Broadband Outage Detection

Nadeem [3562] Batch : 78 INSOFE Bangalore

Problem Description:

India is seeing an explosion of new competitors in the Broadband space. 'India Broadband' is a company that is now seeing a lot of customer churn due to customer dissatisfaction because of broadband outages.The company has now curated a dataset, where it tracks several variables that it believes impact the outage_duration. They have tracked three different outage durations, 0 for no outage, 1 for short outages that last anywhere between a few minutes and a maximum of 2 hours, and 2 for long outages that can last from 2 hours to sometimes even a couple of days.You will now have to use these metrics that the company has tracked to create a machine learning model that will be able to predict the outage_duration so that the company can better handle outages and improve customer satisfaction and therefore reduce customer churn.

Exploratory Data Analysis

Importing the required libararies

In [20]:
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.model_selection import learning_curve, GridSearchCV
from sklearn.ensemble import ExtraTreesClassifier
import xgboost as xgb
import operator
from scipy.stats import uniform as sp_rand
from scipy.stats import randint as sp_randint
import timeit
import matplotlib.pyplot as plt  
%matplotlib inline 
import seaborn as sns
sns.set_style("whitegrid")
tic0=timeit.default_timer()
pd.options.mode.chained_assignment = None  # default='warn'

There are 7 CSV files provided to us, they are described below:

train_data.csv: It has a unique event id for each observation of the outage_duration in a particular area_code

test_data.csv: Similar to the train dataset, we are provided with an id and an area_code, we are expected to predict the outage_duration for each of the records. (This will be provided to you later on 4th July)

broadband_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the broadband_type that is stored in the dataset. There are 10 different types of broadbands that are observed in the dataset

outage_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the outage_type that is stored in the dataset. There are 5 different outage_type's recorded in the dataset.

report_data.csv: For each event id there are log_report_type and volume columns are recorded. log_report_type is a type of the recorded report generated by a technical team member after evaluating the outage. volume is the volume of data handled in the area at the time of report in custom company specific units.

server_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the transit_server_type that is stored in the dataset. Transit Servers handle the requests and responses of the customers.

sample_submission.csv: The format of CSV file required for submission to the evaluation backend. (Please remember that the prediction file which you are going to upload to tool, to check out what is your score should be of the same format as this file)

Reading the data sets

In [2]:
train_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/train_data.csv')
broadband_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/broadband_data.csv')
outage_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/outage_data.csv')
report_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/report_data.csv')
server_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/server_data.csv')
sample_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/sample_submission.csv')

Looking into the shapes of all the data sets

In [3]:
train_data.shape, broadband_data.shape, outage_data.shape, report_data.shape, server_data.shape
Out[3]:
((5904, 3), (21076, 2), (18552, 2), (58671, 3), (31170, 2))

printing the shape of all data sets

In [4]:
print('The shape of broadband is: {}\n'.format(broadband_data.shape))
print('The shape of outage is: {}\n'.format(outage_data.shape))
print('The shape of report is: {}\n'.format(report_data.shape))
print('The shape of server is: {}\n'.format(server_data.shape))
print('The shape of train is: {}\n'.format(train_data.shape))
print('The shape of sample is: {}\n'.format(sample_data.shape))
The shape of broadband is: (21076, 2)

The shape of outage is: (18552, 2)

The shape of report is: (58671, 3)

The shape of server is: (31170, 2)

The shape of train is: (5904, 3)

The shape of sample is: (1477, 2)

                 ---------------------------------------------------------------------

train_data.csv: It has a unique event id for each observation of the outage_duration in a particular area_code

In [5]:
train_data.head()
Out[5]:
id area_code outage_duration
0 13366 area_415 1
1 6783 area_474 0
2 9519 area_931 1
3 10202 area_700 1
4 4555 area_600 2

Train Data has three columns:

  • ID
  • Area Code
  • Outage Duration
In [6]:
train_data['area_code'].value_counts()
Out[6]:
area_821     67
area_1107    61
area_126     59
area_734     57
area_1008    57
             ..
area_38       1
area_334      1
area_99       1
area_1101     1
area_589      1
Name: area_code, Length: 876, dtype: int64

Value counts of Train data : As we can see that Area_821 is repeated most of the times in the data

In [7]:
val=list(train_data['area_code'].value_counts())
for i in range(len(val)):
    print("Area Code",train_data['area_code'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
Area Code area_821 1 %
Area Code area_1107 1 %
Area Code area_126 1 %
Area Code area_734 1 %
Area Code area_1008 1 %
Area Code area_600 1 %
Area Code area_704 1 %
Area Code area_810 1 %
Area Code area_1052 1 %
Area Code area_122 1 %
Area Code area_798 1 %
Area Code area_124 1 %
Area Code area_684 1 %
Area Code area_834 1 %
Area Code area_653 1 %
Area Code area_242 1 %
Area Code area_846 1 %
Area Code area_793 1 %
Area Code area_91 1 %
Area Code area_1019 1 %
Area Code area_763 1 %
Area Code area_471 1 %
Area Code area_976 1 %
Area Code area_1100 1 %
Area Code area_962 1 %
Area Code area_613 1 %
Area Code area_808 1 %
Area Code area_995 1 %
Area Code area_794 1 %
Area Code area_477 1 %
Area Code area_102 1 %
Area Code area_380 0 %
Area Code area_244 0 %
Area Code area_466 0 %
Area Code area_476 0 %
Area Code area_931 0 %
Area Code area_845 0 %
Area Code area_1086 0 %
Area Code area_149 0 %
Area Code area_707 0 %
Area Code area_957 0 %
Area Code area_814 0 %
Area Code area_1042 0 %
Area Code area_135 0 %
Area Code area_118 0 %
Area Code area_921 0 %
Area Code area_475 0 %
Area Code area_891 0 %
Area Code area_497 0 %
Area Code area_181 0 %
Area Code area_496 0 %
Area Code area_495 0 %
Area Code area_465 0 %
Area Code area_484 0 %
Area Code area_1 0 %
Area Code area_894 0 %
Area Code area_478 0 %
Area Code area_892 0 %
Area Code area_895 0 %
Area Code area_899 0 %
Area Code area_744 0 %
Area Code area_638 0 %
Area Code area_460 0 %
Area Code area_1075 0 %
Area Code area_9 0 %
Area Code area_878 0 %
Area Code area_691 0 %
Area Code area_627 0 %
Area Code area_760 0 %
Area Code area_756 0 %
Area Code area_485 0 %
Area Code area_469 0 %
Area Code area_95 0 %
Area Code area_805 0 %
Area Code area_445 0 %
Area Code area_157 0 %
Area Code area_1026 0 %
Area Code area_644 0 %
Area Code area_7 0 %
Area Code area_584 0 %
Area Code area_1018 0 %
Area Code area_505 0 %
Area Code area_488 0 %
Area Code area_836 0 %
Area Code area_1061 0 %
Area Code area_373 0 %
Area Code area_826 0 %
Area Code area_885 0 %
Area Code area_479 0 %
Area Code area_13 0 %
Area Code area_893 0 %
Area Code area_1099 0 %
Area Code area_155 0 %
Area Code area_1010 0 %
Area Code area_1007 0 %
Area Code area_468 0 %
Area Code area_504 0 %
Area Code area_498 0 %
Area Code area_925 0 %
Area Code area_506 0 %
Area Code area_494 0 %
Area Code area_507 0 %
Area Code area_481 0 %
Area Code area_599 0 %
Area Code area_823 0 %
Area Code area_270 0 %
Area Code area_1050 0 %
Area Code area_963 0 %
Area Code area_255 0 %
Area Code area_618 0 %
Area Code area_803 0 %
Area Code area_839 0 %
Area Code area_742 0 %
Area Code area_133 0 %
Area Code area_292 0 %
Area Code area_128 0 %
Area Code area_619 0 %
Area Code area_802 0 %
Area Code area_159 0 %
Area Code area_116 0 %
Area Code area_480 0 %
Area Code area_1031 0 %
Area Code area_500 0 %
Area Code area_1024 0 %
Area Code area_518 0 %
Area Code area_837 0 %
Area Code area_745 0 %
Area Code area_1030 0 %
Area Code area_314 0 %
Area Code area_459 0 %
Area Code area_830 0 %
Area Code area_603 0 %
Area Code area_816 0 %
Area Code area_467 0 %
Area Code area_984 0 %
Area Code area_491 0 %
Area Code area_735 0 %
Area Code area_120 0 %
Area Code area_147 0 %
Area Code area_311 0 %
Area Code area_646 0 %
Area Code area_945 0 %
Area Code area_856 0 %
Area Code area_282 0 %
Area Code area_1097 0 %
Area Code area_737 0 %
Area Code area_648 0 %
Area Code area_809 0 %
Area Code area_400 0 %
Area Code area_1063 0 %
Area Code area_474 0 %
Area Code area_1109 0 %
Area Code area_881 0 %
Area Code area_300 0 %
Area Code area_240 0 %
Area Code area_509 0 %
Area Code area_696 0 %
Area Code area_252 0 %
Area Code area_812 0 %
Area Code area_344 0 %
Area Code area_714 0 %
Area Code area_938 0 %
Area Code area_989 0 %
Area Code area_1111 0 %
Area Code area_944 0 %
Area Code area_472 0 %
Area Code area_378 0 %
Area Code area_738 0 %
Area Code area_470 0 %
Area Code area_818 0 %
Area Code area_298 0 %
Area Code area_12 0 %
Area Code area_1049 0 %
Area Code area_1095 0 %
Area Code area_831 0 %
Area Code area_1067 0 %
Area Code area_100 0 %
Area Code area_17 0 %
Area Code area_276 0 %
Area Code area_851 0 %
Area Code area_1015 0 %
Area Code area_288 0 %
Area Code area_601 0 %
Area Code area_520 0 %
Area Code area_919 0 %
Area Code area_375 0 %
Area Code area_369 0 %
Area Code area_641 0 %
Area Code area_1017 0 %
Area Code area_998 0 %
Area Code area_208 0 %
Area Code area_909 0 %
Area Code area_326 0 %
Area Code area_1014 0 %
Area Code area_1066 0 %
Area Code area_875 0 %
Area Code area_943 0 %
Area Code area_522 0 %
Area Code area_1033 0 %
Area Code area_499 0 %
Area Code area_501 0 %
Area Code area_609 0 %
Area Code area_16 0 %
Area Code area_169 0 %
Area Code area_1056 0 %
Area Code area_1046 0 %
Area Code area_1117 0 %
Area Code area_508 0 %
Area Code area_896 0 %
Area Code area_906 0 %
Area Code area_767 0 %
Area Code area_655 0 %
Area Code area_664 0 %
Area Code area_902 0 %
Area Code area_661 0 %
Area Code area_825 0 %
Area Code area_643 0 %
Area Code area_607 0 %
Area Code area_489 0 %
Area Code area_493 0 %
Area Code area_1058 0 %
Area Code area_236 0 %
Area Code area_2 0 %
Area Code area_690 0 %
Area Code area_946 0 %
Area Code area_790 0 %
Area Code area_1055 0 %
Area Code area_1106 0 %
Area Code area_418 0 %
Area Code area_229 0 %
Area Code area_1089 0 %
Area Code area_890 0 %
Area Code area_1054 0 %
Area Code area_1084 0 %
Area Code area_942 0 %
Area Code area_382 0 %
Area Code area_280 0 %
Area Code area_272 0 %
Area Code area_977 0 %
Area Code area_457 0 %
Area Code area_141 0 %
Area Code area_948 0 %
Area Code area_448 0 %
Area Code area_849 0 %
Area Code area_266 0 %
Area Code area_1053 0 %
Area Code area_1072 0 %
Area Code area_990 0 %
Area Code area_253 0 %
Area Code area_73 0 %
Area Code area_239 0 %
Area Code area_114 0 %
Area Code area_1051 0 %
Area Code area_278 0 %
Area Code area_441 0 %
Area Code area_897 0 %
Area Code area_1088 0 %
Area Code area_325 0 %
Area Code area_1011 0 %
Area Code area_450 0 %
Area Code area_864 0 %
Area Code area_175 0 %
Area Code area_350 0 %
Area Code area_130 0 %
Area Code area_889 0 %
Area Code area_1048 0 %
Area Code area_490 0 %
Area Code area_226 0 %
Area Code area_322 0 %
Area Code area_313 0 %
Area Code area_8 0 %
Area Code area_89 0 %
Area Code area_755 0 %
Area Code area_795 0 %
Area Code area_97 0 %
Area Code area_709 0 %
Area Code area_692 0 %
Area Code area_662 0 %
Area Code area_353 0 %
Area Code area_1025 0 %
Area Code area_161 0 %
Area Code area_1023 0 %
Area Code area_1090 0 %
Area Code area_443 0 %
Area Code area_1112 0 %
Area Code area_363 0 %
Area Code area_1038 0 %
Area Code area_33 0 %
Area Code area_451 0 %
Area Code area_206 0 %
Area Code area_932 0 %
Area Code area_283 0 %
Area Code area_342 0 %
Area Code area_961 0 %
Area Code area_1065 0 %
Area Code area_700 0 %
Area Code area_1094 0 %
Area Code area_72 0 %
Area Code area_817 0 %
Area Code area_884 0 %
Area Code area_290 0 %
Area Code area_699 0 %
Area Code area_829 0 %
Area Code area_228 0 %
Area Code area_127 0 %
Area Code area_328 0 %
Area Code area_723 0 %
Area Code area_172 0 %
Area Code area_324 0 %
Area Code area_778 0 %
Area Code area_769 0 %
Area Code area_883 0 %
Area Code area_999 0 %
Area Code area_533 0 %
Area Code area_645 0 %
Area Code area_249 0 %
Area Code area_415 0 %
Area Code area_1082 0 %
Area Code area_408 0 %
Area Code area_994 0 %
Area Code area_710 0 %
Area Code area_257 0 %
Area Code area_651 0 %
Area Code area_519 0 %
Area Code area_974 0 %
Area Code area_243 0 %
Area Code area_167 0 %
Area Code area_546 0 %
Area Code area_705 0 %
Area Code area_473 0 %
Area Code area_779 0 %
Area Code area_695 0 %
Area Code area_152 0 %
Area Code area_972 0 %
Area Code area_1037 0 %
Area Code area_189 0 %
Area Code area_296 0 %
Area Code area_365 0 %
Area Code area_139 0 %
Area Code area_726 0 %
Area Code area_284 0 %
Area Code area_1079 0 %
Area Code area_979 0 %
Area Code area_409 0 %
Area Code area_268 0 %
Area Code area_853 0 %
Area Code area_367 0 %
Area Code area_800 0 %
Area Code area_923 0 %
Area Code area_759 0 %
Area Code area_332 0 %
Area Code area_1076 0 %
Area Code area_950 0 %
Area Code area_92 0 %
Area Code area_857 0 %
Area Code area_983 0 %
Area Code area_442 0 %
Area Code area_463 0 %
Area Code area_320 0 %
Area Code area_79 0 %
Area Code area_1120 0 %
Area Code area_1103 0 %
Area Code area_1083 0 %
Area Code area_446 0 %
Area Code area_107 0 %
Area Code area_173 0 %
Area Code area_1104 0 %
Area Code area_549 0 %
Area Code area_103 0 %
Area Code area_844 0 %
Area Code area_166 0 %
Area Code area_211 0 %
Area Code area_1034 0 %
Area Code area_702 0 %
Area Code area_822 0 %
Area Code area_879 0 %
Area Code area_964 0 %
Area Code area_673 0 %
Area Code area_797 0 %
Area Code area_151 0 %
Area Code area_248 0 %
Area Code area_903 0 %
Area Code area_754 0 %
Area Code area_1009 0 %
Area Code area_806 0 %
Area Code area_177 0 %
Area Code area_414 0 %
Area Code area_527 0 %
Area Code area_184 0 %
Area Code area_724 0 %
Area Code area_535 0 %
Area Code area_145 0 %
Area Code area_1093 0 %
Area Code area_657 0 %
Area Code area_193 0 %
Area Code area_462 0 %
Area Code area_74 0 %
Area Code area_286 0 %
Area Code area_615 0 %
Area Code area_521 0 %
Area Code area_1044 0 %
Area Code area_345 0 %
Area Code area_403 0 %
Area Code area_887 0 %
Area Code area_667 0 %
Area Code area_765 0 %
Area Code area_632 0 %
Area Code area_640 0 %
Area Code area_405 0 %
Area Code area_51 0 %
Area Code area_154 0 %
Area Code area_309 0 %
Area Code area_526 0 %
Area Code area_251 0 %
Area Code area_119 0 %
Area Code area_171 0 %
Area Code area_204 0 %
Area Code area_87 0 %
Area Code area_57 0 %
Area Code area_413 0 %
Area Code area_398 0 %
Area Code area_356 0 %
Area Code area_915 0 %
Area Code area_348 0 %
Area Code area_264 0 %
Area Code area_882 0 %
Area Code area_430 0 %
Area Code area_649 0 %
Area Code area_390 0 %
Area Code area_877 0 %
Area Code area_757 0 %
Area Code area_978 0 %
Area Code area_693 0 %
Area Code area_1060 0 %
Area Code area_406 0 %
Area Code area_202 0 %
Area Code area_939 0 %
Area Code area_407 0 %
Area Code area_424 0 %
Area Code area_955 0 %
Area Code area_1115 0 %
Area Code area_768 0 %
Area Code area_358 0 %
Area Code area_245 0 %
Area Code area_732 0 %
Area Code area_1020 0 %
Area Code area_706 0 %
Area Code area_652 0 %
Area Code area_1078 0 %
Area Code area_996 0 %
Area Code area_381 0 %
Area Code area_1062 0 %
Area Code area_262 0 %
Area Code area_234 0 %
Area Code area_444 0 %
Area Code area_967 0 %
Area Code area_447 0 %
Area Code area_924 0 %
Area Code area_374 0 %
Area Code area_547 0 %
Area Code area_579 0 %
Area Code area_123 0 %
Area Code area_305 0 %
Area Code area_420 0 %
Area Code area_1116 0 %
Area Code area_604 0 %
Area Code area_602 0 %
Area Code area_1069 0 %
Area Code area_106 0 %
Area Code area_786 0 %
Area Code area_455 0 %
Area Code area_914 0 %
Area Code area_866 0 %
Area Code area_1091 0 %
Area Code area_395 0 %
Area Code area_416 0 %
Area Code area_1098 0 %
Area Code area_256 0 %
Area Code area_918 0 %
Area Code area_686 0 %
Area Code area_1080 0 %
Area Code area_230 0 %
Area Code area_388 0 %
Area Code area_108 0 %
Area Code area_898 0 %
Area Code area_536 0 %
Area Code area_461 0 %
Area Code area_1092 0 %
Area Code area_1081 0 %
Area Code area_32 0 %
Area Code area_210 0 %
Area Code area_153 0 %
Area Code area_318 0 %
Area Code area_721 0 %
Area Code area_404 0 %
Area Code area_165 0 %
Area Code area_238 0 %
Area Code area_36 0 %
Area Code area_570 0 %
Area Code area_1029 0 %
Area Code area_819 0 %
Area Code area_4 0 %
Area Code area_297 0 %
Area Code area_541 0 %
Area Code area_1000 0 %
Area Code area_1047 0 %
Area Code area_131 0 %
Area Code area_34 0 %
Area Code area_1021 0 %
Area Code area_928 0 %
Area Code area_905 0 %
Area Code area_771 0 %
Area Code area_1035 0 %
Area Code area_973 0 %
Area Code area_675 0 %
Area Code area_419 0 %
Area Code area_980 0 %
Area Code area_93 0 %
Area Code area_338 0 %
Area Code area_429 0 %
Area Code area_335 0 %
Area Code area_221 0 %
Area Code area_740 0 %
Area Code area_195 0 %
Area Code area_773 0 %
Area Code area_783 0 %
Area Code area_565 0 %
Area Code area_672 0 %
Area Code area_197 0 %
Area Code area_1016 0 %
Area Code area_611 0 %
Area Code area_1119 0 %
Area Code area_83 0 %
Area Code area_246 0 %
Area Code area_1102 0 %
Area Code area_1022 0 %
Area Code area_232 0 %
Area Code area_1113 0 %
Area Code area_47 0 %
Area Code area_274 0 %
Area Code area_1105 0 %
Area Code area_937 0 %
Area Code area_512 0 %
Area Code area_912 0 %
Area Code area_687 0 %
Area Code area_421 0 %
Area Code area_426 0 %
Area Code area_1126 0 %
Area Code area_15 0 %
Area Code area_88 0 %
Area Code area_852 0 %
Area Code area_674 0 %
Area Code area_355 0 %
Area Code area_713 0 %
Area Code area_346 0 %
Area Code area_384 0 %
Area Code area_396 0 %
Area Code area_782 0 %
Area Code area_628 0 %
Area Code area_343 0 %
Area Code area_752 0 %
Area Code area_758 0 %
Area Code area_410 0 %
Area Code area_303 0 %
Area Code area_775 0 %
Area Code area_566 0 %
Area Code area_534 0 %
Area Code area_514 0 %
Area Code area_44 0 %
Area Code area_223 0 %
Area Code area_680 0 %
Area Code area_637 0 %
Area Code area_393 0 %
Area Code area_294 0 %
Area Code area_926 0 %
Area Code area_676 0 %
Area Code area_137 0 %
Area Code area_616 0 %
Area Code area_815 0 %
Area Code area_991 0 %
Area Code area_560 0 %
Area Code area_697 0 %
Area Code area_629 0 %
Area Code area_19 0 %
Area Code area_678 0 %
Area Code area_337 0 %
Area Code area_76 0 %
Area Code area_401 0 %
Area Code area_1045 0 %
Area Code area_263 0 %
Area Code area_22 0 %
Area Code area_981 0 %
Area Code area_975 0 %
Area Code area_807 0 %
Area Code area_125 0 %
Area Code area_258 0 %
Area Code area_389 0 %
Area Code area_666 0 %
Area Code area_49 0 %
Area Code area_39 0 %
Area Code area_224 0 %
Area Code area_949 0 %
Area Code area_642 0 %
Area Code area_870 0 %
Area Code area_532 0 %
Area Code area_886 0 %
Area Code area_515 0 %
Area Code area_402 0 %
Area Code area_376 0 %
Area Code area_788 0 %
Area Code area_934 0 %
Area Code area_922 0 %
Area Code area_1096 0 %
Area Code area_936 0 %
Area Code area_427 0 %
Area Code area_148 0 %
Area Code area_218 0 %
Area Code area_727 0 %
Area Code area_1121 0 %
Area Code area_435 0 %
Area Code area_112 0 %
Area Code area_731 0 %
Area Code area_302 0 %
Area Code area_144 0 %
Area Code area_679 0 %
Area Code area_639 0 %
Area Code area_1118 0 %
Area Code area_766 0 %
Area Code area_665 0 %
Area Code area_198 0 %
Area Code area_330 0 %
Area Code area_1110 0 %
Area Code area_321 0 %
Area Code area_90 0 %
Area Code area_658 0 %
Area Code area_838 0 %
Area Code area_371 0 %
Area Code area_1006 0 %
Area Code area_104 0 %
Area Code area_360 0 %
Area Code area_876 0 %
Area Code area_277 0 %
Area Code area_596 0 %
Area Code area_180 0 %
Area Code area_772 0 %
Area Code area_162 0 %
Area Code area_56 0 %
Area Code area_718 0 %
Area Code area_1032 0 %
Area Code area_271 0 %
Area Code area_872 0 %
Area Code area_179 0 %
Area Code area_35 0 %
Area Code area_487 0 %
Area Code area_81 0 %
Area Code area_911 0 %
Area Code area_811 0 %
Area Code area_888 0 %
Area Code area_483 0 %
Area Code area_597 0 %
Area Code area_436 0 %
Area Code area_694 0 %
Area Code area_865 0 %
Area Code area_538 0 %
Area Code area_150 0 %
Area Code area_3 0 %
Area Code area_304 0 %
Area Code area_956 0 %
Area Code area_516 0 %
Area Code area_608 0 %
Area Code area_561 0 %
Area Code area_568 0 %
Area Code area_733 0 %
Area Code area_20 0 %
Area Code area_820 0 %
Area Code area_50 0 %
Area Code area_659 0 %
Area Code area_880 0 %
Area Code area_156 0 %
Area Code area_548 0 %
Area Code area_140 0 %
Area Code area_528 0 %
Area Code area_529 0 %
Area Code area_827 0 %
Area Code area_329 0 %
Area Code area_563 0 %
Area Code area_968 0 %
Area Code area_848 0 %
Area Code area_357 0 %
Area Code area_269 0 %
Area Code area_46 0 %
Area Code area_577 0 %
Area Code area_540 0 %
Area Code area_537 0 %
Area Code area_559 0 %
Area Code area_588 0 %
Area Code area_824 0 %
Area Code area_663 0 %
Area Code area_437 0 %
Area Code area_582 0 %
Area Code area_347 0 %
Area Code area_158 0 %
Area Code area_307 0 %
Area Code area_874 0 %
Area Code area_453 0 %
Area Code area_562 0 %
Area Code area_333 0 %
Area Code area_1125 0 %
Area Code area_610 0 %
Area Code area_869 0 %
Area Code area_614 0 %
Area Code area_267 0 %
Area Code area_250 0 %
Area Code area_196 0 %
Area Code area_299 0 %
Area Code area_708 0 %
Area Code area_237 0 %
Area Code area_576 0 %
Area Code area_1036 0 %
Area Code area_14 0 %
Area Code area_222 0 %
Area Code area_855 0 %
Area Code area_792 0 %
Area Code area_215 0 %
Area Code area_323 0 %
Area Code area_860 0 %
Area Code area_511 0 %
Area Code area_6 0 %
Area Code area_59 0 %
Area Code area_65 0 %
Area Code area_1122 0 %
Area Code area_27 0 %
Area Code area_524 0 %
Area Code area_871 0 %
Area Code area_971 0 %
Area Code area_1027 0 %
Area Code area_60 0 %
Area Code area_916 0 %
Area Code area_281 0 %
Area Code area_1074 0 %
Area Code area_813 0 %
Area Code area_231 0 %
Area Code area_722 0 %
Area Code area_359 0 %
Area Code area_917 0 %
Area Code area_331 0 %
Area Code area_717 0 %
Area Code area_425 0 %
Area Code area_960 0 %
Area Code area_315 0 %
Area Code area_904 0 %
Area Code area_110 0 %
Area Code area_464 0 %
Area Code area_840 0 %
Area Code area_136 0 %
Area Code area_113 0 %
Area Code area_835 0 %
Area Code area_716 0 %
Area Code area_841 0 %
Area Code area_63 0 %
Area Code area_53 0 %
Area Code area_492 0 %
Area Code area_595 0 %
Area Code area_557 0 %
Area Code area_859 0 %
Area Code area_1071 0 %
Area Code area_392 0 %
Area Code area_23 0 %
Area Code area_186 0 %
Area Code area_867 0 %
Area Code area_225 0 %
Area Code area_30 0 %
Area Code area_550 0 %
Area Code area_670 0 %
Area Code area_951 0 %
Area Code area_434 0 %
Area Code area_1108 0 %
Area Code area_682 0 %
Area Code area_774 0 %
Area Code area_987 0 %
Area Code area_762 0 %
Area Code area_953 0 %
Area Code area_583 0 %
Area Code area_5 0 %
Area Code area_317 0 %
Area Code area_170 0 %
Area Code area_68 0 %
Area Code area_777 0 %
Area Code area_578 0 %
Area Code area_1013 0 %
Area Code area_117 0 %
Area Code area_273 0 %
Area Code area_1002 0 %
Area Code area_115 0 %
Area Code area_78 0 %
Area Code area_21 0 %
Area Code area_764 0 %
Area Code area_1077 0 %
Area Code area_622 0 %
Area Code area_503 0 %
Area Code area_254 0 %
Area Code area_573 0 %
Area Code area_423 0 %
Area Code area_683 0 %
Area Code area_386 0 %
Area Code area_289 0 %
Area Code area_551 0 %
Area Code area_625 0 %
Area Code area_746 0 %
Area Code area_1068 0 %
Area Code area_10 0 %
Area Code area_339 0 %
Area Code area_220 0 %
Area Code area_183 0 %
Area Code area_1070 0 %
Area Code area_62 0 %
Area Code area_799 0 %
Area Code area_617 0 %
Area Code area_544 0 %
Area Code area_312 0 %
Area Code area_319 0 %
Area Code area_621 0 %
Area Code area_370 0 %
Area Code area_1064 0 %
Area Code area_205 0 %
Area Code area_399 0 %
Area Code area_736 0 %
Area Code area_543 0 %
Area Code area_200 0 %
Area Code area_213 0 %
Area Code area_941 0 %
Area Code area_842 0 %
Area Code area_1087 0 %
Area Code area_624 0 %
Area Code area_391 0 %
Area Code area_947 0 %
Area Code area_143 0 %
Area Code area_440 0 %
Area Code area_558 0 %
Area Code area_525 0 %
Area Code area_362 0 %
Area Code area_37 0 %
Area Code area_719 0 %
Area Code area_901 0 %
Area Code area_847 0 %
Area Code area_530 0 %
Area Code area_194 0 %
Area Code area_260 0 %
Area Code area_1059 0 %
Area Code area_379 0 %
Area Code area_394 0 %
Area Code area_785 0 %
Area Code area_523 0 %
Area Code area_789 0 %
Area Code area_850 0 %
Area Code area_241 0 %
Area Code area_556 0 %
Area Code area_712 0 %
Area Code area_555 0 %
Area Code area_187 0 %
Area Code area_804 0 %
Area Code area_84 0 %
Area Code area_854 0 %
Area Code area_660 0 %
Area Code area_958 0 %
Area Code area_349 0 %
Area Code area_291 0 %
Area Code area_654 0 %
Area Code area_930 0 %
Area Code area_592 0 %
Area Code area_26 0 %
Area Code area_38 0 %
Area Code area_334 0 %
Area Code area_99 0 %
Area Code area_1101 0 %
Area Code area_589 0 %

Plot for Outage Duration

In [8]:
#count plot for Outage Duration 

plt.figure(figsize = (14,6))
sns.countplot(train_data['outage_duration'])
plt.tight_layout()
plt.show()

As there are three different Outage durations

  • outage durations, 0 for no outage
  • outage durations, 1 for short outages that last anywhere between a few minutes and a maximum of 2 hours
  • outage durations, 2 for long outages that can last from 2 hours to sometimes even a couple of days.

Plot with ID vs Area Code

In [9]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data['area_code'],
                 y=train_data['id'],
                 mode="markers",text=train_data["area_code"])]
layout=go.Layout(title="ID vs Area_code",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

From the plot most of the ID's are covered in particular area, As we can see in the graph, as the Number of users are reduces in some of the specific areas

                 ---------------------------------------------------------------------

Plot with ID and Outage Durations.

  • All the three Outage Durations
In [10]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data.loc[train_data.outage_duration==0,'area_code'],
                 y=train_data.loc[train_data.outage_duration==0,'id'],
                 mode="markers",text=train_data["outage_duration"]),
        go.Scatter(x=train_data.loc[train_data.outage_duration==1,'area_code'],
                 y=train_data.loc[train_data.outage_duration==1,'id'],
                 mode="markers",text=train_data["outage_duration"]),
       go.Scatter(x=train_data.loc[train_data.outage_duration==2,'area_code'],
                 y=train_data.loc[train_data.outage_duration==2,'id'],
                 mode="markers",text=train_data["outage_duration"])]
layout=go.Layout(title="ID vs Area Code",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

Checking the number of outage_duration

  • As we can see that there are three levels of outages
  • 0 for NO outages : trace 0 -> Blue circles
  • 1 for Few minutes to few hours : trace 1 -> Red circles
  • 2 for Few Hours and more : trace 2 -> Green circles

By plotting the Data with respect to ID's and outage duration , we can see that the number of users and outage is evenlly distributed, It's like there is no specific user ID's are facing the outages, All number of users are facing the outages

Checking the Plot with respect to Outage Duration 0, In Area Vs ID

In [11]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data.loc[train_data.outage_duration==0,'area_code'],
                 y=train_data.loc[train_data.outage_duration==0,'id'],
                 mode="markers",text=train_data["outage_duration"])]
layout=go.Layout(title="ID vs Area Code with Outage Duration : 0",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

From the plot we see the most of the ID with Area are Not facing much problem

Checking the Plot with respect to Outage Duration 1, In Area Vs ID

In [12]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data.loc[train_data.outage_duration==1,'area_code'],
                 y=train_data.loc[train_data.outage_duration==1,'id'],
                 mode="markers",text=train_data["outage_duration"])]
layout=go.Layout(title="ID vs Area Code Outage Duration : 1",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

Checking the Plot with respect to Outage Duration 2, In Area Vs ID

In [13]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=train_data.loc[train_data.outage_duration==2,'area_code'],
                 y=train_data.loc[train_data.outage_duration==2,'id'],
                 mode="markers",text=train_data["outage_duration"])]
layout=go.Layout(title="ID vs Area Code Outage Duration : 2",xaxis={"title":"area_code"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

From the plot we can see that the few areas are having high outages compared to the other areas

Maximum Outage Duration With Respect to Area

In [14]:
train_data_outage = train_data.loc[train_data.outage_duration==2,'area_code'].value_counts()
In [15]:
train_data_outage
Out[15]:
area_1100    28
area_600     27
area_1107    27
area_821     24
area_734     23
             ..
area_867      1
area_974      1
area_989      1
area_937      1
area_826      1
Name: area_code, Length: 135, dtype: int64
In [16]:
train_data['outage_duration'].value_counts()
Out[16]:
0    3827
1    1496
2     581
Name: outage_duration, dtype: int64
In [17]:
#count plot for Outage Duration 

plt.figure(figsize = (10,4))
sns.countplot(train_data['outage_duration'])
plt.tight_layout()
plt.show()

Outage for the users with percentage representations.

In [18]:
val=list(train_data['outage_duration'].value_counts())
for i in range(len(val)):
    print("Outage type",train_data['outage_duration'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
Outage type 0 65 %
Outage type 1 25 %
Outage type 2 10 %

As we saw above the percentages of the Outage Durations, The visual representation of the same as we can see the number of counts of outages present in the data

Outage Duration

Outage type 0 has 65% that is NO outage for those users,
Outage type 1 and 2 are facing outage breaks

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Outage_Data

outage_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the outage_type that is stored in the dataset. There are 5 different outage_type's recorded in the dataset.

Reading the csv File
In [21]:
outage_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/outage_data.csv')
outage_data.head()
Out[21]:
id outage_type
0 6597 outage_type_2
1 8011 outage_type_2
2 2597 outage_type_2
3 5022 outage_type_1
4 6852 outage_type_1
In [22]:
outage_data.shape
Out[22]:
(18552, 2)
There are two columns and 18852 rows of ID
  • ID : 18552
  • Outage type : There are 5 Different Outage_types
  • shape : (18552,2)

Plotting the number of counts of Outage Type

In [23]:
#count plot for Outage Duration

plt.figure(figsize = (14,6))
sns.countplot(outage_data['outage_type'])
plt.tight_layout()
plt.show()

There are 5 different types from outage type 1 to 5.

  • Outage type 2 and Outage typw 1 is most repeated in the data
In [24]:
outage_data['outage_type'].value_counts()
Out[24]:
outage_type_2    8737
outage_type_1    8728
outage_type_4    1014
outage_type_5      65
outage_type_3       8
Name: outage_type, dtype: int64

Here we can see the value counts of each outage type

Let's look into the percentage of Outage type in the data
In [25]:
val=list(outage_data['outage_type'].value_counts())
for i in range(len(val)):
    print(outage_data['outage_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
outage_type_2 47 %
outage_type_1 47 %
outage_type_4 5 %
outage_type_5 0 %
outage_type_3 0 %

Removing the text

In [26]:
outage_data.outage_type=outage_data.outage_type.apply(lambda x: int(x.split("_")[2]))
In [27]:
outage_data.head()
Out[27]:
id outage_type
0 6597 2
1 8011 2
2 2597 2
3 5022 1
4 6852 1

Dummification of the Outage type

In [28]:
outage_type_dummies = pd.get_dummies(outage_data,columns=['outage_type'])
In [29]:
outage_type_dummies.head()
Out[29]:
id outage_type_1 outage_type_2 outage_type_3 outage_type_4 outage_type_5
0 6597 0 1 0 0 0
1 8011 0 1 0 0 0
2 2597 0 1 0 0 0
3 5022 1 0 0 0 0
4 6852 1 0 0 0 0
In [30]:
outage_data.describe(include='all')
Out[30]:
id outage_type
count 18552.000000 18552.000000
mean 9276.500000 1.649795
std 5355.645433 0.773591
min 1.000000 1.000000
25% 4638.750000 1.000000
50% 9276.500000 2.000000
75% 13914.250000 2.000000
max 18552.000000 5.000000
From the above describtion.
  • total count is 18552
  • Outage type 2 is repeated the most
  • There are 5 level's of Outage type
  • Frequency is 8737, i.e., the Outage type has been repeated in the record

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Server_Data

server_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the transit_server_type that is stored in the dataset. Transit Servers handle the requests and responses of the customers.

Looking into the first 5 rows of the data
In [31]:
server_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/server_data.csv')
server_data.head()
Out[31]:
id transit_server_type
0 6597 transit_server_type_11
1 8011 transit_server_type_15
2 2597 transit_server_type_15
3 5022 transit_server_type_15
4 5022 transit_server_type_11
In [32]:
server_data.shape
Out[32]:
(31170, 2)
In [33]:
server_data.nunique().value_counts()
Out[33]:
53       1
18552    1
dtype: int64

From the Server Data.

  • there are two columns in the dataframe
  • ID : 31170 and unique is 18552
  • Server type : Total there are 53 Server's in the given data set

Checking the Value count of the server type.

In [34]:
server_count = server_data['transit_server_type'].value_counts()
server_count
Out[34]:
transit_server_type_11    7888
transit_server_type_35    6615
transit_server_type_34    5927
transit_server_type_15    4395
transit_server_type_20    1458
transit_server_type_54     684
transit_server_type_13     582
transit_server_type_42     478
transit_server_type_44     466
transit_server_type_23     429
transit_server_type_14     330
transit_server_type_43     306
transit_server_type_22     223
transit_server_type_50     154
transit_server_type_10     145
transit_server_type_21     136
transit_server_type_18      73
transit_server_type_47      69
transit_server_type_26      65
transit_server_type_32      63
transit_server_type_30      60
transit_server_type_45      53
transit_server_type_24      46
transit_server_type_27      44
transit_server_type_29      42
transit_server_type_40      40
transit_server_type_46      38
transit_server_type_2       37
transit_server_type_28      32
transit_server_type_8       29
transit_server_type_6       28
transit_server_type_5       26
transit_server_type_7       24
transit_server_type_3       19
transit_server_type_38      19
transit_server_type_39      18
transit_server_type_36      18
transit_server_type_53      17
transit_server_type_49      17
transit_server_type_19      14
transit_server_type_9       14
transit_server_type_37      10
transit_server_type_31      10
transit_server_type_12       6
transit_server_type_25       5
transit_server_type_51       4
transit_server_type_1        4
transit_server_type_17       3
transit_server_type_48       2
transit_server_type_41       2
transit_server_type_52       1
transit_server_type_4        1
transit_server_type_33       1
Name: transit_server_type, dtype: int64

Checking the Value count with percentage wise

In [35]:
val=list(server_data['transit_server_type'].value_counts())
for i in range(len(val)):
    print(server_data['transit_server_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
transit_server_type_11 25 %
transit_server_type_35 21 %
transit_server_type_34 19 %
transit_server_type_15 14 %
transit_server_type_20 5 %
transit_server_type_54 2 %
transit_server_type_13 2 %
transit_server_type_42 2 %
transit_server_type_44 1 %
transit_server_type_23 1 %
transit_server_type_14 1 %
transit_server_type_43 1 %
transit_server_type_22 1 %
transit_server_type_50 0 %
transit_server_type_10 0 %
transit_server_type_21 0 %
transit_server_type_18 0 %
transit_server_type_47 0 %
transit_server_type_26 0 %
transit_server_type_32 0 %
transit_server_type_30 0 %
transit_server_type_45 0 %
transit_server_type_24 0 %
transit_server_type_27 0 %
transit_server_type_29 0 %
transit_server_type_40 0 %
transit_server_type_46 0 %
transit_server_type_2 0 %
transit_server_type_28 0 %
transit_server_type_8 0 %
transit_server_type_6 0 %
transit_server_type_5 0 %
transit_server_type_7 0 %
transit_server_type_3 0 %
transit_server_type_38 0 %
transit_server_type_39 0 %
transit_server_type_36 0 %
transit_server_type_53 0 %
transit_server_type_49 0 %
transit_server_type_19 0 %
transit_server_type_9 0 %
transit_server_type_37 0 %
transit_server_type_31 0 %
transit_server_type_12 0 %
transit_server_type_25 0 %
transit_server_type_51 0 %
transit_server_type_1 0 %
transit_server_type_17 0 %
transit_server_type_48 0 %
transit_server_type_41 0 %
transit_server_type_52 0 %
transit_server_type_4 0 %
transit_server_type_33 0 %

Plotting with ID vs Transit Server Type

In [36]:
# x and y given as array_like objects
import plotly.express as px
fig = px.scatter(server_data['id'], server_data['transit_server_type'])
fig.show()
ID in X-axis and Transit server in Y-axis.
  • In the scartted plot Few ID's are using few servers
  • Likely Many Users are opting to very servers.

As from the above plot we can see.

  • Server 11 has most users : 25% of users from the given data, i.e., 7888 times it has been repeated
  • Server 35 has next number of users.

Top Five server Types used the most in the Given Data.

In [38]:
server_count.head()
Out[38]:
transit_server_type_11    7888
transit_server_type_35    6615
transit_server_type_34    5927
transit_server_type_15    4395
transit_server_type_20    1458
Name: transit_server_type, dtype: int64
In [39]:
server_data.head()
Out[39]:
id transit_server_type
0 6597 transit_server_type_11
1 8011 transit_server_type_15
2 2597 transit_server_type_15
3 5022 transit_server_type_15
4 5022 transit_server_type_11

Let's Remove the text from the transit server type.

In [40]:
server_data.transit_server_type=server_data.transit_server_type.apply(lambda x: int(x.split("_")[3]))
After removing the text from the column transit_server_type, the top 10 rows
In [684]:
server_data.head(10)
Out[684]:
id transit_server_type
0 6597 11
1 8011 15
2 2597 15
3 5022 15
4 5022 11
5 6852 11
6 6852 15
7 5611 15
8 14838 15
9 14838 11

Plotting Server type Count

In [41]:
#count plot for transit server type

plt.figure(figsize = (14,6))
sns.countplot(server_data['transit_server_type'])
plt.tight_layout()
plt.show()

Exploring the Value Counts of the Server Data

In [42]:
server_data['transit_server_type'].value_counts().sum()
Out[42]:
31170
In [43]:
server_data.shape
Out[43]:
(31170, 2)

Dummification of the server type as we know there are 53 server types.

In [44]:
server_type_dummies = pd.get_dummies(server_data,columns=['transit_server_type'])
In [45]:
server_type_dummies.head()
Out[45]:
id transit_server_type_1 transit_server_type_2 transit_server_type_3 transit_server_type_4 transit_server_type_5 transit_server_type_6 transit_server_type_7 transit_server_type_8 transit_server_type_9 ... transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_48 transit_server_type_49 transit_server_type_50 transit_server_type_51 transit_server_type_52 transit_server_type_53 transit_server_type_54
0 6597 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 8011 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 2597 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 5022 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 5022 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 54 columns

In [46]:
server_type_dummies.shape
Out[46]:
(31170, 54)
After Dummification we get 53 new columns and with ID we have 54 columns and 31170 ID
In [47]:
server_data.describe(include='all')
Out[47]:
id transit_server_type
count 31170.000000 31170.000000
mean 9267.735258 24.740680
std 5363.817181 12.152183
min 1.000000 1.000000
25% 4625.250000 11.000000
50% 9287.500000 23.000000
75% 13915.000000 35.000000
max 18552.000000 54.000000
In [176]:
server_data.shape
Out[176]:
(31170, 2)
In [180]:
server_type_dummies.describe()
Out[180]:
id transit_server_type_1 transit_server_type_2 transit_server_type_3 transit_server_type_4 transit_server_type_5 transit_server_type_6 transit_server_type_7 transit_server_type_8 transit_server_type_9 ... transit_server_type_45 transit_server_type_46 transit_server_type_47 transit_server_type_48 transit_server_type_49 transit_server_type_50 transit_server_type_51 transit_server_type_52 transit_server_type_53 transit_server_type_54
count 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 ... 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000 31170.000000
mean 9267.735258 0.000128 0.001187 0.000610 0.000032 0.000834 0.000898 0.000770 0.000930 0.000449 ... 0.001700 0.001219 0.002214 0.000064 0.000545 0.004941 0.000128 0.000032 0.000545 0.021944
std 5363.817181 0.011328 0.034434 0.024682 0.005664 0.028870 0.029959 0.027738 0.030488 0.021189 ... 0.041201 0.034895 0.046998 0.008010 0.023348 0.070117 0.011328 0.005664 0.023348 0.146504
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 4625.250000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 9287.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 13915.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 18552.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 54 columns

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Report Data

report_data.csv: For each event id there are log_report_type and volume columns are recorded. log_report_type is a type of the recorded report generated by a technical team member after evaluating the outage. volume is the volume of data handled in the area at the time of report in custom company specific units.

Reading the Report Data

In [48]:
report_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/report_data.csv')

Taking a look at top five rows from the report data

In [49]:
report_data.head()
Out[49]:
id log_report_type volume
0 6597 log_report_type_68 6
1 8011 log_report_type_68 7
2 2597 log_report_type_68 1
3 5022 log_report_type_172 2
4 5022 log_report_type_56 1
In [50]:
report_data.shape
Out[50]:
(58671, 3)
In [51]:
report_data.nunique()
Out[51]:
id                 18552
log_report_type      386
volume               341
dtype: int64

Report Data: There are three columns

  • ID : 58671
  • Log_report_type: Has 386 log report customer service record issue
  • Volume : There are 341 Volumes, for 10 minutes prior to the time of recording the observation as per custom company specific units.

Value Count for Log Report Type

In [52]:
log_report_ = report_data['log_report_type'].value_counts()
In [53]:
log_report_.head(10)
Out[53]:
log_report_type_312    5267
log_report_type_232    4754
log_report_type_82     3472
log_report_type_203    2823
log_report_type_313    2145
log_report_type_233    1901
log_report_type_307    1597
log_report_type_54     1573
log_report_type_170    1526
log_report_type_71     1514
Name: log_report_type, dtype: int64

Looking into the top 10 report's issued by the custombers, Frequently more in the given Data

Let's See the Percentage for the Log Report Type

In [54]:
val=list(report_data['log_report_type'].value_counts())
for i in range(len(val)):
    print(report_data['log_report_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
log_report_type_312 9 %
log_report_type_232 8 %
log_report_type_82 6 %
log_report_type_203 5 %
log_report_type_313 4 %
log_report_type_233 3 %
log_report_type_307 3 %
log_report_type_54 3 %
log_report_type_170 3 %
log_report_type_71 3 %
log_report_type_315 3 %
log_report_type_134 2 %
log_report_type_80 2 %
log_report_type_235 2 %
log_report_type_193 2 %
log_report_type_219 2 %
log_report_type_68 2 %
log_report_type_227 2 %
log_report_type_314 2 %
log_report_type_201 2 %
log_report_type_234 2 %
log_report_type_73 1 %
log_report_type_195 1 %
log_report_type_301 1 %
log_report_type_309 1 %
log_report_type_55 1 %
log_report_type_229 1 %
log_report_type_273 1 %
log_report_type_308 1 %
log_report_type_368 1 %
log_report_type_376 1 %
log_report_type_171 1 %
log_report_type_228 1 %
log_report_type_283 1 %
log_report_type_306 1 %
log_report_type_291 1 %
log_report_type_310 1 %
log_report_type_230 1 %
log_report_type_70 1 %
log_report_type_345 1 %
log_report_type_81 1 %
log_report_type_191 1 %
log_report_type_375 1 %
log_report_type_202 0 %
log_report_type_56 0 %
log_report_type_221 0 %
log_report_type_172 0 %
log_report_type_209 0 %
log_report_type_87 0 %
log_report_type_179 0 %
log_report_type_75 0 %
log_report_type_290 0 %
log_report_type_44 0 %
log_report_type_160 0 %
log_report_type_74 0 %
log_report_type_182 0 %
log_report_type_316 0 %
log_report_type_196 0 %
log_report_type_94 0 %
log_report_type_207 0 %
log_report_type_222 0 %
log_report_type_374 0 %
log_report_type_135 0 %
log_report_type_223 0 %
log_report_type_51 0 %
log_report_type_181 0 %
log_report_type_163 0 %
log_report_type_211 0 %
log_report_type_305 0 %
log_report_type_47 0 %
log_report_type_236 0 %
log_report_type_220 0 %
log_report_type_360 0 %
log_report_type_167 0 %
log_report_type_197 0 %
log_report_type_362 0 %
log_report_type_204 0 %
log_report_type_289 0 %
log_report_type_39 0 %
log_report_type_76 0 %
log_report_type_311 0 %
log_report_type_198 0 %
log_report_type_109 0 %
log_report_type_154 0 %
log_report_type_62 0 %
log_report_type_103 0 %
log_report_type_280 0 %
log_report_type_153 0 %
log_report_type_276 0 %
log_report_type_206 0 %
log_report_type_285 0 %
log_report_type_133 0 %
log_report_type_85 0 %
log_report_type_155 0 %
log_report_type_95 0 %
log_report_type_86 0 %
log_report_type_349 0 %
log_report_type_38 0 %
log_report_type_378 0 %
log_report_type_370 0 %
log_report_type_318 0 %
log_report_type_66 0 %
log_report_type_132 0 %
log_report_type_218 0 %
log_report_type_83 0 %
log_report_type_240 0 %
log_report_type_239 0 %
log_report_type_284 0 %
log_report_type_188 0 %
log_report_type_63 0 %
log_report_type_42 0 %
log_report_type_161 0 %
log_report_type_293 0 %
log_report_type_101 0 %
log_report_type_157 0 %
log_report_type_52 0 %
log_report_type_212 0 %
log_report_type_354 0 %
log_report_type_205 0 %
log_report_type_35 0 %
log_report_type_231 0 %
log_report_type_353 0 %
log_report_type_118 0 %
log_report_type_217 0 %
log_report_type_304 0 %
log_report_type_105 0 %
log_report_type_8 0 %
log_report_type_20 0 %
log_report_type_111 0 %
log_report_type_141 0 %
log_report_type_10 0 %
log_report_type_127 0 %
log_report_type_187 0 %
log_report_type_150 0 %
log_report_type_108 0 %
log_report_type_277 0 %
log_report_type_183 0 %
log_report_type_65 0 %
log_report_type_303 0 %
log_report_type_152 0 %
log_report_type_168 0 %
log_report_type_359 0 %
log_report_type_278 0 %
log_report_type_46 0 %
log_report_type_84 0 %
log_report_type_1 0 %
log_report_type_358 0 %
log_report_type_162 0 %
log_report_type_369 0 %
log_report_type_136 0 %
log_report_type_45 0 %
log_report_type_237 0 %
log_report_type_125 0 %
log_report_type_357 0 %
log_report_type_169 0 %
log_report_type_317 0 %
log_report_type_53 0 %
log_report_type_30 0 %
log_report_type_120 0 %
log_report_type_367 0 %
log_report_type_37 0 %
log_report_type_41 0 %
log_report_type_15 0 %
log_report_type_156 0 %
log_report_type_69 0 %
log_report_type_288 0 %
log_report_type_361 0 %
log_report_type_107 0 %
log_report_type_117 0 %
log_report_type_164 0 %
log_report_type_29 0 %
log_report_type_190 0 %
log_report_type_40 0 %
log_report_type_225 0 %
log_report_type_25 0 %
log_report_type_149 0 %
log_report_type_2 0 %
log_report_type_61 0 %
log_report_type_67 0 %
log_report_type_50 0 %
log_report_type_279 0 %
log_report_type_166 0 %
log_report_type_137 0 %
log_report_type_78 0 %
log_report_type_114 0 %
log_report_type_200 0 %
log_report_type_173 0 %
log_report_type_48 0 %
log_report_type_365 0 %
log_report_type_319 0 %
log_report_type_98 0 %
log_report_type_139 0 %
log_report_type_18 0 %
log_report_type_122 0 %
log_report_type_58 0 %
log_report_type_49 0 %
log_report_type_27 0 %
log_report_type_140 0 %
log_report_type_178 0 %
log_report_type_224 0 %
log_report_type_176 0 %
log_report_type_302 0 %
log_report_type_165 0 %
log_report_type_238 0 %
log_report_type_186 0 %
log_report_type_348 0 %
log_report_type_380 0 %
log_report_type_104 0 %
log_report_type_145 0 %
log_report_type_189 0 %
log_report_type_332 0 %
log_report_type_131 0 %
log_report_type_57 0 %
log_report_type_199 0 %
log_report_type_373 0 %
log_report_type_177 0 %
log_report_type_43 0 %
log_report_type_60 0 %
log_report_type_275 0 %
log_report_type_115 0 %
log_report_type_194 0 %
log_report_type_292 0 %
log_report_type_377 0 %
log_report_type_295 0 %
log_report_type_24 0 %
log_report_type_130 0 %
log_report_type_247 0 %
log_report_type_33 0 %
log_report_type_79 0 %
log_report_type_22 0 %
log_report_type_124 0 %
log_report_type_5 0 %
log_report_type_253 0 %
log_report_type_241 0 %
log_report_type_265 0 %
log_report_type_323 0 %
log_report_type_158 0 %
log_report_type_338 0 %
log_report_type_226 0 %
log_report_type_383 0 %
log_report_type_243 0 %
log_report_type_184 0 %
log_report_type_339 0 %
log_report_type_59 0 %
log_report_type_384 0 %
log_report_type_151 0 %
log_report_type_268 0 %
log_report_type_298 0 %
log_report_type_36 0 %
log_report_type_21 0 %
log_report_type_299 0 %
log_report_type_126 0 %
log_report_type_113 0 %
log_report_type_138 0 %
log_report_type_26 0 %
log_report_type_340 0 %
log_report_type_110 0 %
log_report_type_251 0 %
log_report_type_261 0 %
log_report_type_4 0 %
log_report_type_327 0 %
log_report_type_64 0 %
log_report_type_99 0 %
log_report_type_216 0 %
log_report_type_28 0 %
log_report_type_274 0 %
log_report_type_264 0 %
log_report_type_346 0 %
log_report_type_282 0 %
log_report_type_112 0 %
log_report_type_77 0 %
log_report_type_148 0 %
log_report_type_344 0 %
log_report_type_366 0 %
log_report_type_381 0 %
log_report_type_93 0 %
log_report_type_97 0 %
log_report_type_88 0 %
log_report_type_174 0 %
log_report_type_347 0 %
log_report_type_242 0 %
log_report_type_328 0 %
log_report_type_92 0 %
log_report_type_246 0 %
log_report_type_89 0 %
log_report_type_7 0 %
log_report_type_90 0 %
log_report_type_270 0 %
log_report_type_116 0 %
log_report_type_23 0 %
log_report_type_146 0 %
log_report_type_185 0 %
log_report_type_256 0 %
log_report_type_331 0 %
log_report_type_72 0 %
log_report_type_259 0 %
log_report_type_267 0 %
log_report_type_322 0 %
log_report_type_330 0 %
log_report_type_102 0 %
log_report_type_337 0 %
log_report_type_129 0 %
log_report_type_320 0 %
log_report_type_244 0 %
log_report_type_249 0 %
log_report_type_371 0 %
log_report_type_147 0 %
log_report_type_91 0 %
log_report_type_342 0 %
log_report_type_266 0 %
log_report_type_11 0 %
log_report_type_255 0 %
log_report_type_215 0 %
log_report_type_142 0 %
log_report_type_100 0 %
log_report_type_210 0 %
log_report_type_324 0 %
log_report_type_336 0 %
log_report_type_96 0 %
log_report_type_14 0 %
log_report_type_329 0 %
log_report_type_128 0 %
log_report_type_321 0 %
log_report_type_385 0 %
log_report_type_180 0 %
log_report_type_364 0 %
log_report_type_13 0 %
log_report_type_16 0 %
log_report_type_372 0 %
log_report_type_379 0 %
log_report_type_272 0 %
log_report_type_269 0 %
log_report_type_106 0 %
log_report_type_257 0 %
log_report_type_17 0 %
log_report_type_326 0 %
log_report_type_263 0 %
log_report_type_300 0 %
log_report_type_32 0 %
log_report_type_335 0 %
log_report_type_6 0 %
log_report_type_352 0 %
log_report_type_213 0 %
log_report_type_382 0 %
log_report_type_334 0 %
log_report_type_294 0 %
log_report_type_355 0 %
log_report_type_341 0 %
log_report_type_123 0 %
log_report_type_254 0 %
log_report_type_333 0 %
log_report_type_208 0 %
log_report_type_343 0 %
log_report_type_252 0 %
log_report_type_258 0 %
log_report_type_245 0 %
log_report_type_192 0 %
log_report_type_287 0 %
log_report_type_175 0 %
log_report_type_214 0 %
log_report_type_297 0 %
log_report_type_281 0 %
log_report_type_144 0 %
log_report_type_356 0 %
log_report_type_296 0 %
log_report_type_386 0 %
log_report_type_119 0 %
log_report_type_363 0 %
log_report_type_12 0 %
log_report_type_286 0 %
log_report_type_351 0 %
log_report_type_248 0 %
log_report_type_31 0 %
log_report_type_260 0 %
log_report_type_262 0 %
log_report_type_19 0 %
log_report_type_3 0 %
log_report_type_121 0 %
log_report_type_9 0 %
log_report_type_250 0 %
log_report_type_271 0 %
log_report_type_143 0 %
log_report_type_325 0 %
log_report_type_34 0 %
log_report_type_350 0 %
log_report_type_159 0 %

Plot for Log Report Type

In [55]:
#count plot for Log Report Type

plt.figure(figsize = (14,6))
sns.countplot(report_data['log_report_type'])
plt.tight_layout()
plt.show()

From the above plot without removing the text from the feature it's not able to see clearly

Removing the text from the Log Report Type column and keeping numerator alone

In [56]:
report_data.log_report_type=report_data.log_report_type.apply(lambda x: int(x.split("_")[3]))
In [57]:
report_data.head()
Out[57]:
id log_report_type volume
0 6597 68 6
1 8011 68 7
2 2597 68 1
3 5022 172 2
4 5022 56 1
In [58]:
#count plot for Log Report Type

plt.figure(figsize = (14,6))
sns.countplot(report_data['log_report_type'].value_counts())
plt.tight_layout()
plt.show()

The Above plot represents the number of reports in assending order of value counts in top order of log report by customber complents

Top records of log report type receved by the customer service representatives

In [59]:
log_report_.head(10)
Out[59]:
log_report_type_312    5267
log_report_type_232    4754
log_report_type_82     3472
log_report_type_203    2823
log_report_type_313    2145
log_report_type_233    1901
log_report_type_307    1597
log_report_type_54     1573
log_report_type_170    1526
log_report_type_71     1514
Name: log_report_type, dtype: int64

Reset of index and given it a new name

In [60]:
report_data.reset_index(inplace=True)
report_data.rename(columns={'index':'count_of_log_report_seen'},inplace=True)
The New Feature 'Count of Log report seen' is to map back to the orginal ID after creating new features and to check the feature created is properlly matched!!!
In [61]:
report_data.head()
Out[61]:
count_of_log_report_seen id log_report_type volume
0 0 6597 68 6
1 1 8011 68 7
2 2 2597 68 1
3 3 5022 172 2
4 4 5022 56 1

Creating New Feature with number of ID of log features

In [62]:
report_data_value_counts = report_data.log_report_type.value_counts().to_dict()
report_data['num_ids_with_log_feature'] = report_data['log_report_type'].map(lambda x: report_data_value_counts[x])
Creating new feature by mapping the log report type by using dictionary and put in the new feature
In [63]:
report_data.head()
Out[63]:
count_of_log_report_seen id log_report_type volume num_ids_with_log_feature
0 0 6597 68 6 1093
1 1 8011 68 7 1093
2 2 2597 68 1 1093
3 3 5022 172 2 214
4 4 5022 56 1 244

Manually binning the report data to map the ID

In [64]:
bins = [-10,0,10,20,30,40,50,60,70,80,90,100,110,120,130,140,150,160,170,180,190,
        200,210,220,230,240,250,260,270,280,290,300,310,320,330,340,350,360,
        370,380,390,400]
report_data['binned_log_report'] = np.digitize(report_data['log_report_type'], bins, right=True)
bins_offset = list(map(lambda x:x+5, bins))
report_data['binned_offset_log_report'] = np.digitize(report_data['log_report_type'], bins_offset, right=True)
As we have many levels in the log report type, we have done manual binning to reduce the complexity.
  • Created new feature 'Binned log report'
  • np.digitize : Return the indices of the bins to which each value in input array belongs.

  • Another Feature for Binned offset log report

  • Bins offset Not be more then 5, to keep in threshold.
In [65]:
report_data.head()
Out[65]:
count_of_log_report_seen id log_report_type volume num_ids_with_log_feature binned_log_report binned_offset_log_report
0 0 6597 68 6 1093 8 8
1 1 8011 68 7 1093 8 8
2 2 2597 68 1 1093 8 8
3 3 5022 172 2 214 19 18
4 4 5022 56 1 244 7 7

Above we could see there is a lag of shift in the features binned so we have to position the log feature,

  • We'll Create a column position log features and groupby the cummulitive sum

Creating Feature for lag position

In [66]:
report_data['position_of_log_report'] = 1
report_data['position_of_log_report'] = report_data.groupby(['id'])['position_of_log_report'].cumsum()
report_data['log_report_type'] = report_data['log_report_type'].astype(int)
  • Initially assigning position of log report to 1
  • Grouping the ID with respect to Position of log report with cummulative sum(Partially summing of sequence)
In [67]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=report_data.loc[report_data.log_report_type,'log_report_type'],
                 y=train_data.loc[report_data.log_report_type,'id'],
                 mode="markers",text=report_data["log_report_type"])]
layout=go.Layout(title="ID vs Log Report Type",xaxis={"title":"Log Report Type"},yaxis=dict(title="ID"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

From the plot as we have explored, The log report is varied with the ID user and log report

In [68]:
report_data.head()
Out[68]:
count_of_log_report_seen id log_report_type volume num_ids_with_log_feature binned_log_report binned_offset_log_report position_of_log_report
0 0 6597 68 6 1093 8 8 1
1 1 8011 68 7 1093 8 8 1
2 2 2597 68 1 1093 8 8 1
3 3 5022 172 2 214 19 18 1
4 4 5022 56 1 244 7 7 2
In [69]:
report_data.describe(include='all')
Out[69]:
count_of_log_report_seen id log_report_type volume num_ids_with_log_feature binned_log_report binned_offset_log_report position_of_log_report
count 58671.000000 58671.000000 58671.000000 58671.000000 58671.000000 58671.000000 58671.000000 58671.000000
mean 29335.000000 9271.368905 209.092465 9.685296 1873.706175 22.472465 21.756234 2.816809
std 16937.003159 5355.142878 93.066417 27.314433 1680.450516 9.317041 9.318942 2.196520
min 0.000000 1.000000 1.000000 1.000000 1.000000 2.000000 1.000000 1.000000
25% 14667.500000 4658.500000 134.000000 1.000000 484.000000 15.000000 14.000000 1.000000
50% 29335.000000 9275.000000 227.000000 2.000000 1419.000000 24.000000 24.000000 2.000000
75% 44002.500000 13903.000000 307.000000 7.000000 2823.000000 32.000000 32.000000 4.000000
max 58670.000000 18552.000000 386.000000 1310.000000 5267.000000 40.000000 40.000000 20.000000
In [177]:
report_data.shape
Out[177]:
(58671, 8)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Broadband Data

broadband_data.csv: For each of the event ids mentioned in the train_data.csv and test_data.csv files and also some additional ids there is a record of the broadband_type that is stored in the dataset. There are 10 different types of broadbands that are observed in the dataset

Reading the Broadband Data

In [70]:
broadband_data = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/broadband_data.csv')
In [71]:
broadband_data.head()
Out[71]:
id broadband_type
0 6597 broadband_type_8
1 8011 broadband_type_8
2 2597 broadband_type_8
3 5022 broadband_type_8
4 6852 broadband_type_8
In [72]:
broadband_data.shape
Out[72]:
(21076, 2)
In [73]:
broadband_data.nunique()
Out[73]:
id                18552
broadband_type       10
dtype: int64

BroadBand Data:

  • ID : 21076, Unique ID : 18552
  • Broadband type : 10 Broadband types avaliable

Value Count for Broadband Data

In [74]:
broadband_data['broadband_type'].value_counts()
Out[74]:
broadband_type_8     10268
broadband_type_2      8918
broadband_type_6       582
broadband_type_7       498
broadband_type_4       330
broadband_type_9       190
broadband_type_3       145
broadband_type_10       73
broadband_type_1        58
broadband_type_5        14
Name: broadband_type, dtype: int64

Broadband types description

  • broadband_type_8 : 'ADSL 1'
  • broadband_type_2 : 'ADSL 2'
  • broadband_type_6 : 'ADSL 2+'
  • broadband_type_7 : 'Cable'
  • broadband_type_4 : 'Fiber 1'
  • broadband_type_9 : 'BPL'
  • broadband_type_3 : 'Fiber 2'
  • broadband_type_10 : 'Fiber High Speed'
  • broadband_type_1 : 'Fiber Ultra'
  • broadband_type_5 : 'Fiber Ultra Max'

** From the data Broadband type 8 : 'ADSL 1' is used more in the given data, then broadband type 2 and so onn.

Percentage of Broadband types through Value counts in the Data

In [75]:
val=list(broadband_data['broadband_type'].value_counts())
for i in range(len(val)):
    print(broadband_data['broadband_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
broadband_type_8 49 %
broadband_type_2 42 %
broadband_type_6 3 %
broadband_type_7 2 %
broadband_type_4 2 %
broadband_type_9 1 %
broadband_type_3 1 %
broadband_type_10 0 %
broadband_type_1 0 %
broadband_type_5 0 %

Plot for Broadband types

In [76]:
#count plot for broadband type

plt.figure(figsize = (14,6))
sns.countplot(broadband_data['broadband_type'])
plt.tight_layout()
plt.show()

From the plot we can see the usage of Broadband by the used as per the data given.

Removing the Text from the Broadband type and keeping numbers.

In [77]:
broadband_data.broadband_type=broadband_data.broadband_type.apply(lambda x: int(x.split("_")[2]))
In [78]:
broadband_data
Out[78]:
id broadband_type
0 6597 8
1 8011 8
2 2597 8
3 5022 8
4 6852 8
... ... ...
21071 3761 8
21072 8720 8
21073 6488 8
21074 878 8
21075 4464 8

21076 rows × 2 columns

In [79]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=broadband_data.id,
                 y=broadband_data.broadband_type,
                 mode="markers",text=broadband_data["broadband_type"])]
layout=go.Layout(title="ID vs Broadband Type",xaxis={"title":"ID"},yaxis=dict(title="Broadband Type"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()
In [80]:
broadband_data.head()
Out[80]:
id broadband_type
0 6597 8
1 8011 8
2 2597 8
3 5022 8
4 6852 8

Dummification of Broadband type

In [81]:
broadband_type_dummies = pd.get_dummies(broadband_data,columns=['broadband_type'])
In [82]:
broadband_type_dummies
Out[82]:
id broadband_type_1 broadband_type_2 broadband_type_3 broadband_type_4 broadband_type_5 broadband_type_6 broadband_type_7 broadband_type_8 broadband_type_9 broadband_type_10
0 6597 0 0 0 0 0 0 0 1 0 0
1 8011 0 0 0 0 0 0 0 1 0 0
2 2597 0 0 0 0 0 0 0 1 0 0
3 5022 0 0 0 0 0 0 0 1 0 0
4 6852 0 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ...
21071 3761 0 0 0 0 0 0 0 1 0 0
21072 8720 0 0 0 0 0 0 0 1 0 0
21073 6488 0 0 0 0 0 0 0 1 0 0
21074 878 0 0 0 0 0 0 0 1 0 0
21075 4464 0 0 0 0 0 0 0 1 0 0

21076 rows × 11 columns

After Dummifing the broadband type we get (21076x11) columns

In [83]:
val=list(broadband_data['broadband_type'].value_counts())
for i in range(len(val)):
    print(broadband_data['broadband_type'].value_counts().index[i],round(val[i]/sum(val)*100),'%')
8 49 %
2 42 %
6 3 %
7 2 %
4 2 %
9 1 %
3 1 %
10 0 %
1 0 %
5 0 %

As we can see that Broadband type 8 : ADSL1 is used more in the data

  • After Binning the data, Found that broadband 8 must be put in different columns by shifting 1 or lag 01

Creating new feature Broadband Shifted up

  • As we can see that after dummification broadband type 8 is repeated 49% of time and broadband type 2 is has 42%
  • To make Data more representative we're shifting it to one
In [84]:
broadband_data['broadband_shifted_up'] = broadband_data['broadband_type'].shift(1)
In [85]:
broadband_data['broadband_shifted_up'] = broadband_data['broadband_shifted_up'].astype('float64','ignore')
In [86]:
broadband_data.head()
Out[86]:
id broadband_type broadband_shifted_up
0 6597 8 NaN
1 8011 8 8.0
2 2597 8 8.0
3 5022 8 8.0
4 6852 8 8.0

After the shift we're getting NaN

  • we need to fill the NaN
In [87]:
broadband_data.dtypes
Out[87]:
id                        int64
broadband_type            int64
broadband_shifted_up    float64
dtype: object
In [88]:
broadband_data['broadband_type']=broadband_data['broadband_type'].astype('int64')
In [89]:
broadband_data.isnull().sum()
Out[89]:
id                      0
broadband_type          0
broadband_shifted_up    1
dtype: int64

We can see that above feature has NaN so we should fill the data with broadband type 8 as it's repeated more.

In [90]:
broadband_data['broadband_shifted_up'].fillna(8,inplace=True)
In [91]:
broadband_data.head()
Out[91]:
id broadband_type broadband_shifted_up
0 6597 8 8.0
1 8011 8 8.0
2 2597 8 8.0
3 5022 8 8.0
4 6852 8 8.0
In [92]:
broadband_data.describe()
Out[92]:
id broadband_type broadband_shifted_up
count 21076.000000 21076.000000 21076.000000
mean 9255.869330 5.279987 5.279987
std 5366.730222 2.924911 2.924911
min 1.000000 1.000000 1.000000
25% 4599.750000 2.000000 2.000000
50% 9256.500000 7.000000 7.000000
75% 13907.250000 8.000000 8.000000
max 18552.000000 10.000000 10.000000

Let's check the next Broadband type repeated or it's Different from the previous one!

  • Create two new features or the repeated and different type!!
In [93]:
broadband_data['is_next_broadband_type_repeat'] = 0
broadband_data['is_next_broadband_type_different'] = 0
In [94]:
broadband_data.head()
Out[94]:
id broadband_type broadband_shifted_up is_next_broadband_type_repeat is_next_broadband_type_different
0 6597 8 8.0 0 0
1 8011 8 8.0 0 0
2 2597 8 8.0 0 0
3 5022 8 8.0 0 0
4 6852 8 8.0 0 0

Initially we have assigned it to '0'

Cheaking with the original broadband type is correctly matched with shifted broadband type
In [95]:
repeat_cond = broadband_data['broadband_shifted_up'] == broadband_data['broadband_type']
In [96]:
repeat_cond.value_counts
Out[96]:
<bound method IndexOpsMixin.value_counts of 0        True
1        True
2        True
3        True
4        True
         ... 
21071    True
21072    True
21073    True
21074    True
21075    True
Length: 21076, dtype: bool>

Creating New Feature is next broadband type repeated

In [97]:
broadband_data['is_next_broadband_type_repeat'][repeat_cond] = 1
In [98]:
broadband_data.head()
Out[98]:
id broadband_type broadband_shifted_up is_next_broadband_type_repeat is_next_broadband_type_different
0 6597 8 8.0 1 0
1 8011 8 8.0 1 0
2 2597 8 8.0 1 0
3 5022 8 8.0 1 0
4 6852 8 8.0 1 0
In [99]:
broadband_data['is_next_broadband_type_repeat'][~repeat_cond] = 1 # ~ Binary Ones Complement
In [100]:
broadband_data.head()
Out[100]:
id broadband_type broadband_shifted_up is_next_broadband_type_repeat is_next_broadband_type_different
0 6597 8 8.0 1 0
1 8011 8 8.0 1 0
2 2597 8 8.0 1 0
3 5022 8 8.0 1 0
4 6852 8 8.0 1 0

Creating new feature switches broadband type

  • Return the cumulative sum of the elements along a given axis. (cumsum())
In [101]:
broadband_data['switches_broadband_type'] = broadband_data['is_next_broadband_type_different'].cumsum()
In [102]:
broadband_data['switches_broadband_type'].value_counts()
Out[102]:
0    21076
Name: switches_broadband_type, dtype: int64
In [103]:
broadband_data
Out[103]:
id broadband_type broadband_shifted_up is_next_broadband_type_repeat is_next_broadband_type_different switches_broadband_type
0 6597 8 8.0 1 0 0
1 8011 8 8.0 1 0 0
2 2597 8 8.0 1 0 0
3 5022 8 8.0 1 0 0
4 6852 8 8.0 1 0 0
... ... ... ... ... ... ...
21071 3761 8 8.0 1 0 0
21072 8720 8 8.0 1 0 0
21073 6488 8 8.0 1 0 0
21074 878 8 8.0 1 0 0
21075 4464 8 8.0 1 0 0

21076 rows × 6 columns

In [104]:
broadband_data.nunique().value_counts
Out[104]:
<bound method IndexOpsMixin.value_counts of id                                  18552
broadband_type                         10
broadband_shifted_up                   10
is_next_broadband_type_repeat           1
is_next_broadband_type_different        1
switches_broadband_type                 1
dtype: int64>

There are no repeatation in the dummified data, we can drop the features.

In [105]:
broadband_data.describe()
Out[105]:
id broadband_type broadband_shifted_up is_next_broadband_type_repeat is_next_broadband_type_different switches_broadband_type
count 21076.000000 21076.000000 21076.000000 21076.0 21076.0 21076.0
mean 9255.869330 5.279987 5.279987 1.0 0.0 0.0
std 5366.730222 2.924911 2.924911 0.0 0.0 0.0
min 1.000000 1.000000 1.000000 1.0 0.0 0.0
25% 4599.750000 2.000000 2.000000 1.0 0.0 0.0
50% 9256.500000 7.000000 7.000000 1.0 0.0 0.0
75% 13907.250000 8.000000 8.000000 1.0 0.0 0.0
max 18552.000000 10.000000 10.000000 1.0 0.0 0.0
In [106]:
broadband_data.dtypes
Out[106]:
id                                    int64
broadband_type                        int64
broadband_shifted_up                float64
is_next_broadband_type_repeat         int64
is_next_broadband_type_different      int64
switches_broadband_type               int64
dtype: object
In [178]:
broadband_data.shape
Out[178]:
(21076, 6)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Merging the files

Checking the Unique ID's From all the Data set's

In [107]:
print("Number of unique ID's in Train Data      : ",train_data.id.nunique())
print("Number of unique ID's in Broadband Data  : ",broadband_data.id.nunique())
print("Number of unique ID's in Server Data     : ",server_data.id.nunique())
print("Number of unique ID's in Outage Data     : ",outage_data.id.nunique())
print("Number of unique ID's in Report Data     : ",report_data.id.nunique()) 
Number of unique ID's in Train Data      :  5904
Number of unique ID's in Broadband Data  :  18552
Number of unique ID's in Server Data     :  18552
Number of unique ID's in Outage Data     :  18552
Number of unique ID's in Report Data     :  18552

Printing the Shape of all the Data Set's

In [108]:
print('The shape of train is: {}\n'.format(train_data.shape))
print('The shape of broadband is: {}\n'.format(broadband_data.shape))
print('The shape of server is: {}\n'.format(server_data.shape))
print('The shape of outage is: {}\n'.format(outage_data.shape))
print('The shape of report is: {}\n'.format(report_data.shape))
The shape of train is: (5904, 3)

The shape of broadband is: (21076, 6)

The shape of server is: (31170, 2)

The shape of outage is: (18552, 2)

The shape of report is: (58671, 8)

Merging the Train Data with Broadband Data

In [109]:
train_broadband_combined = pd.merge(train_data,broadband_data,left_on = ['id'],
                               right_on = ['id'],how='left')
In [110]:
train_broadband_combined.shape
Out[110]:
(6754, 8)
In [111]:
train_broadband_combined.head()
Out[111]:
id area_code outage_duration broadband_type broadband_shifted_up is_next_broadband_type_repeat is_next_broadband_type_different switches_broadband_type
0 13366 area_415 1 2 2.0 1 0 0
1 6783 area_474 0 2 2.0 1 0 0
2 9519 area_931 1 8 8.0 1 0 0
3 10202 area_700 1 8 8.0 1 0 0
4 4555 area_600 2 8 8.0 1 0 0

Answering the Questions

● Which areas are most prone to long outage durations?

● Which broadband types are suspect of long outage durations?

● Any other recommendations to improve the detection of outage durations.

To answer the question!!

  • We have merged the train data with broadband data
In [112]:
train_broadband_outage = train_broadband_combined.loc[train_broadband_combined.outage_duration==2,'broadband_type'].value_counts()
In [113]:
train_broadband_outage
Out[113]:
8     544
2      82
6      26
4      17
1      10
7       9
9       4
3       4
5       3
10      1
Name: broadband_type, dtype: int64

By locating the Outage Duration == 2 the most prone outage duration, filtering with broadband type we got!!!

Broadband type 8 has most number of outages

Plot for Broadband types

In [114]:
#count plot for broadband type

plt.figure(figsize = (14,6))
sns.countplot(train_broadband_combined['broadband_type'])
plt.tight_layout()
plt.show()

From the Above plot Broadband type : 8 is more in the Data

Creating an Long outage my taking outage duration = 2 alone!!

In [117]:
long_outage = train_data['outage_duration']==2
long_outage.head()
Out[117]:
0    False
1    False
2    False
3    False
4     True
Name: outage_duration, dtype: bool
In [118]:
long_outage_ = train_data[long_outage]
In [119]:
long_outage_.head()
Out[119]:
id area_code outage_duration
4 4555 area_600 2
16 1373 area_962 2
24 4209 area_1086 2
32 8984 area_976 2
34 14361 area_704 2

Plot with Area code vs Outage Duration

In [120]:
count=long_outage_.area_code.value_counts()
count.head(10).plot(kind="bar",figsize=(10,10))
count.head(10)
Out[120]:
area_1100    28
area_600     27
area_1107    27
area_821     24
area_734     23
area_704     19
area_1019    18
area_810     18
area_684     17
area_798     17
Name: area_code, dtype: int64

From the above graph we clearly see that Area_1100 has more Outages and is more Prone to outages!!!

Defining a function to remove the text from the area code

In [121]:
def str_to_num(string):
    return int(string.split("_")[1])
In [122]:
train_data["area_code"]=train_data["area_code"].apply(str_to_num)
In [123]:
train_data.head()
Out[123]:
id area_code outage_duration
0 13366 415 1
1 6783 474 0
2 9519 931 1
3 10202 700 1
4 4555 600 2

Merging the Train Data with Outage Data

In [125]:
merge_1 = train_data.merge(outage_data,how="left",left_on=["id"],right_on=["id"])
In [126]:
merge_1.head()
Out[126]:
id area_code outage_duration outage_type
0 13366 415 1 4
1 6783 474 0 2
2 9519 931 1 2
3 10202 700 1 1
4 4555 600 2 2
In [127]:
merge_1.set_index("id",inplace=True)
In [128]:
merge_1.describe()
Out[128]:
area_code outage_duration outage_type
count 5904.000000 5904.000000 5904.000000
mean 623.909045 0.450203 1.656673
std 320.360280 0.666642 0.759276
min 1.000000 0.000000 1.000000
25% 375.000000 0.000000 1.000000
50% 661.000000 0.000000 2.000000
75% 892.000000 1.000000 2.000000
max 1126.000000 2.000000 5.000000

Creating a new feature to group by area code and outage duration

In [129]:
merge_1["num"]=merge_1.groupby(['area_code',"outage_duration"]).cumcount()+1
In [130]:
merge_1[(merge_1["area_code"]==1) & (merge_1["outage_duration"]==1)]
Out[130]:
area_code outage_duration outage_type num
id
9664 1 1 2 1
3952 1 1 1 2
15952 1 1 1 3
1235 1 1 1 4
14595 1 1 2 5
5280 1 1 1 6
17041 1 1 1 7

Number Num represents number of outages of each type in an area to understand in which area what kind of outage is more so that we can focus on reducing it

In [131]:
merge_1.num.value_counts()
Out[131]:
1     1386
2      862
3      586
4      434
5      342
6      285
7      241
8      209
9      183
10     166
11     146
12     127
13     110
14      95
15      91
16      82
17      70
18      58
19      51
20      44
21      40
22      38
23      33
24      28
25      22
26      20
27      19
28      14
29      12
31       9
30       9
34       8
33       8
32       8
35       7
37       6
36       6
39       5
38       5
41       5
42       5
40       5
43       4
45       3
44       3
47       2
46       2
55       1
48       1
52       1
56       1
49       1
53       1
51       1
50       1
54       1
57       1
Name: num, dtype: int64

Plot with respect to Number of Cases registered

In [132]:
data=[go.Histogram(x=merge_1["num"])]
layout=go.Layout(title="Histogram")
layout=go.Layout(title="number of cases",xaxis={"title":"number of cases"},yaxis=dict(title="count of no of cases"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

we can understand how many cases are common in an area like in general, we can find minimum of 3 cases there in 586 areas.

Plot with respect to Number of cases vs Area

In [133]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=merge_1.loc[merge_1.outage_duration==0,'area_code'],
                 y=merge_1.loc[merge_1.outage_duration==0,'num'],
                 mode="markers",text=merge_1["outage_duration"]),
        go.Scatter(x=merge_1.loc[merge_1.outage_duration==1,'area_code'],
                 y=merge_1.loc[merge_1.outage_duration==1,'num'],
                 mode="markers",text=merge_1["outage_duration"]),
       go.Scatter(x=merge_1.loc[merge_1.outage_duration==2,'area_code'],
                 y=merge_1.loc[merge_1.outage_duration==2,'num'],
                 mode="markers",text=merge_1["outage_duration"])]
layout=go.Layout(title="num vs area_code",xaxis={"title":"area_code"},yaxis=dict(title="num"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

To understand in which area, which kind of outage is more, So that we can focus on reducing it, i.e., to get more customer churn in an area we should concentrate on which kind of outage type.

Merging Train Data with Server Data

In [135]:
server_data
Out[135]:
id transit_server_type
0 6597 11
1 8011 15
2 2597 15
3 5022 15
4 5022 11
... ... ...
31165 3761 11
31166 8720 11
31167 6488 11
31168 878 11
31169 4464 11

31170 rows × 2 columns

In [136]:
merge_2=server_data.merge(train_data, on='id')
In [137]:
merge_2.head()
Out[137]:
id transit_server_type area_code outage_duration
0 8011 15 1 0
1 2588 15 1 0
2 2588 11 1 0
3 4848 11 1 0
4 4848 15 1 0

Checking for Any Null values

In [567]:
merge_2.isnull().sum()
Out[567]:
id                     0
transit_server_type    0
area_code              0
outage_duration        0
dtype: int64

Checking for Unique Values

In [138]:
merge_2.nunique()
Out[138]:
id                     5904
transit_server_type      49
area_code               876
outage_duration           3
dtype: int64
In [139]:
merge_2.shape
Out[139]:
(9968, 4)

Assigning values counts to the DataFrame!!

In [140]:
transit_server_type_unq=pd.DataFrame(merge_2.transit_server_type.value_counts())
transit_server_type_unq.head()
Out[140]:
transit_server_type
11 2473
35 2158
34 1925
15 1370
20 437

Creating New Column to check with Percentage transit Server Type

In [141]:
transit_server_type_unq["percen_trn"]=merge_2.transit_server_type.value_counts(normalize=True)*100
In [142]:
transit_server_type_unq
Out[142]:
transit_server_type percen_trn
11 2473 24.809390
35 2158 21.649278
34 1925 19.311798
15 1370 13.743981
20 437 4.384029
54 203 2.036517
13 201 2.016453
23 159 1.595104
42 148 1.484751
44 142 1.424559
14 114 1.143660
43 99 0.993178
22 75 0.752408
21 50 0.501605
10 44 0.441413
50 37 0.371188
18 28 0.280899
47 24 0.240770
30 23 0.230738
26 22 0.220706
32 21 0.210674
45 18 0.180578
24 17 0.170546
46 16 0.160514
29 14 0.140449
5 13 0.130417
2 13 0.130417
7 13 0.130417
40 12 0.120385
28 12 0.120385
6 12 0.120385
27 11 0.110353
38 8 0.080257
8 7 0.070225
36 7 0.070225
3 6 0.060193
39 5 0.050161
49 5 0.050161
37 4 0.040128
53 4 0.040128
19 3 0.030096
51 3 0.030096
9 3 0.030096
31 3 0.030096
12 2 0.020064
41 1 0.010032
25 1 0.010032
1 1 0.010032
48 1 0.010032
In [143]:
merge_2.head()
Out[143]:
id transit_server_type area_code outage_duration
0 8011 15 1 0
1 2588 15 1 0
2 2588 11 1 0
3 4848 11 1 0
4 4848 15 1 0

Creating New Feature Num to Group by Server Type and Area Code

In [144]:
merge_2["num"]=merge_2.groupby(['transit_server_type','area_code']).cumcount()+1
In [145]:
merge_2.head()
Out[145]:
id transit_server_type area_code outage_duration num
0 8011 15 1 0 1
1 2588 15 1 0 2
2 2588 11 1 0 1
3 4848 11 1 0 2
4 4848 15 1 0 3

Plotting Number of Servers vs Area Code

In [147]:
import plotly.offline as pyo
import plotly.graph_objs as go
data = [go.Scatter(x=merge_2.loc[:,'area_code'],
                 y=merge_2.loc[:,'num'],
                 mode="markers")]
layout=go.Layout(title="Number of Servers vs Area Code",xaxis={"title":"area_code"},yaxis=dict(title="number of servers"),hovermode="closest")
fig=go.Figure(data,layout)
fig.show()

From the above plot, To understand how Servers are divided with respect to Area-wise

Merging Train Data with Broadband Data

In [148]:
train_data_ = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/train_data.csv')
broadband_data_ = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/broadband_data.csv')
In [156]:
train_broadband_combined_ = pd.merge(train_data,broadband_data_,left_on = ['id'],
                               right_on = ['id'],how='left')

Assigning Max Outage Duration : 2 to Long Outage

In [157]:
long_outage_2 = train_broadband_combined_['outage_duration']==2
long_outage_2.head()
Out[157]:
0    False
1    False
2    False
3    False
4     True
Name: outage_duration, dtype: bool

Taking Long Outage 2.

In [158]:
long_outage_2_=train_broadband_combined_[long_outage_2]
In [159]:
long_outage_2_.head()
Out[159]:
id area_code outage_duration broadband_type
4 4555 600 2 broadband_type_8
16 1373 962 2 broadband_type_8
24 4209 1086 2 broadband_type_8
33 8984 976 2 broadband_type_8
35 14361 704 2 broadband_type_8

Plotting the Broadband with Max Outage Duration

In [161]:
broad_band=long_outage_2_.broadband_type.value_counts()
broad_band.head(10).plot(kind="bar",figsize=(10,10))
broad_band.head(10)
Out[161]:
broadband_type_8     544
broadband_type_2      82
broadband_type_6      26
broadband_type_4      17
broadband_type_1      10
broadband_type_7       9
broadband_type_9       4
broadband_type_3       4
broadband_type_5       3
broadband_type_10      1
Name: broadband_type, dtype: int64

From the Above Plot, It is clear that Broadband type 8 has Highest and longest Outage Duration

Merging Train Data with Server Data

In [162]:
train_data_ = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/train_data.csv')
server_data_ = pd.read_csv('/Users/nadeemm/Desktop/PHD B78/data/server_data.csv')
In [163]:
train_server_combined_1 = pd.merge(train_data_,server_data_,left_on = ['id'],
                               right_on = ['id'],how='left')
In [164]:
train_server_combined_1
Out[164]:
id area_code outage_duration transit_server_type
0 13366 area_415 1 transit_server_type_35
1 6783 area_474 0 transit_server_type_35
2 6783 area_474 0 transit_server_type_34
3 9519 area_931 1 transit_server_type_15
4 10202 area_700 1 transit_server_type_11
... ... ... ... ...
9963 10475 area_821 0 transit_server_type_11
9964 10675 area_798 2 transit_server_type_15
9965 14714 area_210 0 transit_server_type_35
9966 14714 area_210 0 transit_server_type_34
9967 6253 area_1007 0 transit_server_type_15

9968 rows × 4 columns

Assigning Outage 2 to long outage

In [165]:
long_outage_3 = train_server_combined_1['outage_duration']==2
long_outage_3.head()
Out[165]:
0    False
1    False
2    False
3    False
4    False
Name: outage_duration, dtype: bool
In [166]:
long_outage_3_=train_server_combined_1[long_outage_3]
In [167]:
long_outage_3_.head()
Out[167]:
id area_code outage_duration transit_server_type
5 4555 area_600 2 transit_server_type_15
23 1373 area_962 2 transit_server_type_15
24 1373 area_962 2 transit_server_type_11
35 4209 area_1086 2 transit_server_type_15
47 8984 area_976 2 transit_server_type_15

Plotting Server type with Outage type

In [168]:
server_outage=long_outage_3_.transit_server_type.value_counts()
server_outage.head(10)
server_outage.head(10).plot(kind="bar",figsize=(10,10))
Out[168]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a18cd9550>

Recommendation for detecting the Outage duration is by looking at the reports (log_report_type) generated by the customers.

Merging Data

In [173]:
train_broadband_outage.shape
Out[173]:
(6754, 8)
In [174]:
train_data.head(), outage_data.head(), broadband_data.head(), server_data.head(), report_data.head()
Out[174]:
(      id  area_code  outage_duration
 0  13366        415                1
 1   6783        474                0
 2   9519        931                1
 3  10202        700                1
 4   4555        600                2,
      id  outage_type
 0  6597            2
 1  8011            2
 2  2597            2
 3  5022            1
 4  6852            1,
      id  broadband_type  broadband_shifted_up  is_next_broadband_type_repeat  \
 0  6597               8                   8.0                              1   
 1  8011               8                   8.0                              1   
 2  2597               8                   8.0                              1   
 3  5022               8                   8.0                              1   
 4  6852               8                   8.0                              1   
 
    is_next_broadband_type_different  switches_broadband_type  
 0                                 0                        0  
 1                                 0                        0  
 2                                 0                        0  
 3                                 0                        0  
 4                                 0                        0  ,
      id  transit_server_type
 0  6597                   11
 1  8011                   15
 2  2597                   15
 3  5022                   15
 4  5022                   11,
    count_of_log_report_seen    id  log_report_type  volume  \
 0                         0  6597               68       6   
 1                         1  8011               68       7   
 2                         2  2597               68       1   
 3                         3  5022              172       2   
 4                         4  5022               56       1   
 
    num_ids_with_log_feature  binned_log_report  binned_offset_log_report  \
 0                      1093                  8                         8   
 1                      1093                  8                         8   
 2                      1093                  8                         8   
 3                       214                 19                        18   
 4                       244                  7                         7   
 
    position_of_log_report  
 0                       1  
 1                       1  
 2                       1  
 3                       1  
 4                       2  )
In [181]:
train_data.shape, broadband_data.shape, outage_data.shape, report_data.shape, server_type_dummies.shape
Out[181]:
((5904, 3), (21076, 6), (18552, 2), (58671, 8), (31170, 54))

Merge 01 with train and Server data

In [182]:
train_server_merge_1 = pd.merge(train_data,server_type_dummies,left_on = ['id'],
                               right_on = ['id'],how='left')
In [183]:
train_server_merge_1.head(),train_server_merge_1.shape
Out[183]:
(      id  area_code  outage_duration  transit_server_type_1  \
 0  13366        415                1                      0   
 1   6783        474                0                      0   
 2   6783        474                0                      0   
 3   9519        931                1                      0   
 4  10202        700                1                      0   
 
    transit_server_type_2  transit_server_type_3  transit_server_type_4  \
 0                      0                      0                      0   
 1                      0                      0                      0   
 2                      0                      0                      0   
 3                      0                      0                      0   
 4                      0                      0                      0   
 
    transit_server_type_5  transit_server_type_6  transit_server_type_7  ...  \
 0                      0                      0                      0  ...   
 1                      0                      0                      0  ...   
 2                      0                      0                      0  ...   
 3                      0                      0                      0  ...   
 4                      0                      0                      0  ...   
 
    transit_server_type_45  transit_server_type_46  transit_server_type_47  \
 0                       0                       0                       0   
 1                       0                       0                       0   
 2                       0                       0                       0   
 3                       0                       0                       0   
 4                       0                       0                       0   
 
    transit_server_type_48  transit_server_type_49  transit_server_type_50  \
 0                       0                       0                       0   
 1                       0                       0                       0   
 2                       0                       0                       0   
 3                       0                       0                       0   
 4                       0                       0                       0   
 
    transit_server_type_51  transit_server_type_52  transit_server_type_53  \
 0                       0                       0                       0   
 1                       0                       0                       0   
 2                       0                       0                       0   
 3                       0                       0                       0   
 4                       0                       0                       0   
 
    transit_server_type_54  
 0                       0  
 1                       0  
 2                       0  
 3                       0  
 4                       0  
 
 [5 rows x 56 columns],
 (9968, 56))

Merge 02 : Train with Outage Data

In [184]:
train_outage_merge_02 = pd.merge(train_data,outage_data,left_on = ['id'],
                               right_on = ['id'],how='left')
In [185]:
train_outage_merge_02.head(), train_outage_merge_02.shape
Out[185]:
(      id  area_code  outage_duration  outage_type
 0  13366        415                1            4
 1   6783        474                0            2
 2   9519        931                1            2
 3  10202        700                1            1
 4   4555        600                2            2,
 (5904, 4))

Merge 03 : Train with Report Data

In [186]:
train_report_merge_03 = pd.merge(train_data,report_data,left_on = ['id'],
                               right_on = ['id'],how='left')
In [187]:
train_report_merge_03.head(),train_report_merge_03.shape
Out[187]:
(      id  area_code  outage_duration  count_of_log_report_seen  \
 0  13366        415                1                     23718   
 1   6783        474                0                     27710   
 2   6783        474                0                     27711   
 3   6783        474                0                     27712   
 4   6783        474                0                     27713   
 
    log_report_type  volume  num_ids_with_log_feature  binned_log_report  \
 0              312       1                      5267                 33   
 1              312       2                      5267                 33   
 2              233       1                      1901                 25   
 3              232       1                      4754                 25   
 4              315       2                      1495                 33   
 
    binned_offset_log_report  position_of_log_report  
 0                        32                       1  
 1                        32                       1  
 2                        24                       2  
 3                        24                       3  
 4                        32                       4  ,
 (18995, 10))

Merge 04 : Train Data with Broadband Data

In [188]:
train_broadband_merge_04 = pd.merge(train_data,broadband_data,left_on = ['id'],
                               right_on = ['id'],how='left')
In [189]:
train_broadband_merge_04.head(), train_broadband_merge_04.shape
Out[189]:
(      id  area_code  outage_duration  broadband_type  broadband_shifted_up  \
 0  13366        415                1               2                   2.0   
 1   6783        474                0               2                   2.0   
 2   9519        931                1               8                   8.0   
 3  10202        700                1               8                   8.0   
 4   4555        600                2               8                   8.0   
 
    is_next_broadband_type_repeat  is_next_broadband_type_different  \
 0                              1                                 0   
 1                              1                                 0   
 2                              1                                 0   
 3                              1                                 0   
 4                              1                                 0   
 
    switches_broadband_type  
 0                        0  
 1                        0  
 2                        0  
 3                        0  
 4                        0  ,
 (6754, 8))

Merging with Train, Server and Outage Individually

In [190]:
train_server_outage = pd.merge(train_server_merge_1,train_outage_merge_02,left_on = ['id'],
                               right_on = ['id'],how='left')
In [191]:
train_server_outage.head(), train_server_outage.shape
Out[191]:
(      id  area_code_x  outage_duration_x  transit_server_type_1  \
 0  13366          415                  1                      0   
 1   6783          474                  0                      0   
 2   6783          474                  0                      0   
 3   9519          931                  1                      0   
 4  10202          700                  1                      0   
 
    transit_server_type_2  transit_server_type_3  transit_server_type_4  \
 0                      0                      0                      0   
 1                      0                      0                      0   
 2                      0                      0                      0   
 3                      0                      0                      0   
 4                      0                      0                      0   
 
    transit_server_type_5  transit_server_type_6  transit_server_type_7  ...  \
 0                      0                      0                      0  ...   
 1                      0                      0                      0  ...   
 2                      0                      0                      0  ...   
 3                      0                      0                      0  ...   
 4                      0                      0                      0  ...   
 
    transit_server_type_48  transit_server_type_49  transit_server_type_50  \
 0                       0                       0                       0   
 1                       0                       0                       0   
 2                       0                       0                       0   
 3                       0                       0                       0   
 4                       0                       0                       0   
 
    transit_server_type_51  transit_server_type_52  transit_server_type_53  \
 0                       0                       0                       0   
 1                       0                       0                       0   
 2                       0                       0                       0   
 3                       0                       0                       0   
 4                       0                       0                       0   
 
    transit_server_type_54  area_code_y  outage_duration_y  outage_type  
 0                       0          415                  1            4  
 1                       0          474                  0            2  
 2                       0          474                  0            2  
 3                       0          931                  1            2  
 4                       0          700                  1            1  
 
 [5 rows x 59 columns],
 (9968, 59))

Merging Train Data, Broadband Data, Report Data

In [192]:
train_broadband_report = pd.merge(train_broadband_merge_04,train_report_merge_03,left_on = ['id'],
                               right_on = ['id'],how='left')
In [193]:
train_broadband_report.head(), train_broadband_report.shape
Out[193]:
(      id  area_code_x  outage_duration_x  broadband_type  \
 0  13366          415                  1               2   
 1   6783          474                  0               2   
 2   6783          474                  0               2   
 3   6783          474                  0               2   
 4   6783          474                  0               2   
 
    broadband_shifted_up  is_next_broadband_type_repeat  \
 0                   2.0                              1   
 1                   2.0                              1   
 2                   2.0                              1   
 3                   2.0                              1   
 4                   2.0                              1   
 
    is_next_broadband_type_different  switches_broadband_type  area_code_y  \
 0                                 0                        0          415   
 1                                 0                        0          474   
 2                                 0                        0          474   
 3                                 0                        0          474   
 4                                 0                        0          474   
 
    outage_duration_y  count_of_log_report_seen  log_report_type  volume  \
 0                  1                     23718              312       1   
 1                  0                     27710              312       2   
 2                  0                     27711              233       1   
 3                  0                     27712              232       1   
 4                  0                     27713              315       2   
 
    num_ids_with_log_feature  binned_log_report  binned_offset_log_report  \
 0                      5267                 33                        32   
 1                      5267                 33                        32   
 2                      1901                 25                        24   
 3                      4754                 25                        24   
 4                      1495                 33                        32   
 
    position_of_log_report  
 0                       1  
 1                       1  
 2                       2  
 3                       3  
 4                       4  ,
 (22679, 17))
In [ ]:
### Final Merge
In [194]:
final_merge = pd.merge(train_server_outage,train_broadband_report,left_on = ['id'],
                               right_on = ['id'],how='left')
In [195]:
final_merge.head()
Out[195]:
id area_code_x_x outage_duration_x_x transit_server_type_1 transit_server_type_2 transit_server_type_3 transit_server_type_4 transit_server_type_5 transit_server_type_6 transit_server_type_7 ... switches_broadband_type area_code_y_y outage_duration_y_y count_of_log_report_seen log_report_type volume num_ids_with_log_feature binned_log_report binned_offset_log_report position_of_log_report
0 13366 415 1 0 0 0 0 0 0 0 ... 0 415 1 23718 312 1 5267 33 32 1
1 6783 474 0 0 0 0 0 0 0 0 ... 0 474 0 27710 312 2 5267 33 32 1
2 6783 474 0 0 0 0 0 0 0 0 ... 0 474 0 27711 233 1 1901 25 24 2
3 6783 474 0 0 0 0 0 0 0 0 ... 0 474 0 27712 232 1 4754 25 24 3
4 6783 474 0 0 0 0 0 0 0 0 ... 0 474 0 27713 315 2 1495 33 32 4

5 rows × 75 columns

In [196]:
final_merge.shape
Out[196]:
(48973, 75)
In [202]:
final_merge.nunique().value_counts
Out[202]:
<bound method IndexOpsMixin.value_counts of id                          5904
area_code_x_x                876
outage_duration_x_x            3
transit_server_type_1          2
transit_server_type_2          2
                            ... 
volume                       240
num_ids_with_log_feature     129
binned_log_report             39
binned_offset_log_report      39
position_of_log_report        19
Length: 75, dtype: int64>
In [205]:
type_=pd.crosstab(final_merge['outage_duration_x_x'],final_merge['outage_type']) 
type_.plot(kind="bar", stacked=True, figsize=(8,6))
type_
Out[205]:
outage_type 1 2 3 4 5
outage_duration_x_x
0 13167 15603 33 520 24
1 9210 3388 0 197 17
2 6325 489 0 0 0
In [207]:
dat=pd.crosstab(final_merge['broadband_type'],final_merge['outage_duration_x_x']) 
dat.plot(kind="bar", stacked=True, figsize=(8,6))
dat
Out[207]:
outage_duration_x_x 0 1 2
broadband_type
1 172 56 253
2 19003 4615 1456
3 149 286 126
4 703 402 364
5 0 0 110
6 692 848 498
7 1052 879 306
8 7096 5352 3533
9 279 266 98
10 201 108 70
In [ ]: